Signed-off-by: Jaeha Choi <email@example.com>
|8 months ago|
|wikidb||8 months ago|
|.gitignore||10 months ago|
|Readme.md||10 months ago|
|config.json||8 months ago|
|main.go||10 months ago|
|namespaces.txt||10 months ago|
A quote from Wikipedia/Pageview:
Wikipedia pageviews of certain types of articles correlate with changes in stock market prices, box office success of movies, spread of disease among other applications of datamining. Since search engines directly influence what is popular on Wikipedia such statistics may provide a more unfiltered and real-time view into what people are searching for on the Web and societal interests.
Create MariaDB database
CREATE DATABASE *database_name* DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
CREATE USER '*user*'@'localhost' IDENTIFIED BY '*password*';
GRANT ALL PRIVILEGES ON *databae_name*.* TO '*user*'@'localhost';
Apply access changes
(Optional) Depending on the MariaDB version, you would need to run following commands:
Create plain text file called
.auth in a same directory as the executable
Add following items to
.auth file in order, separated by new line:
namespaces.txt is in the same directory as the executable.
Run executable with
--initialize option to generate default
config.json file if necessary. Check Configuration Option to see available options.
Initial setup is complete.
WikiPageviewDB has three modes.
.nextfile was present, data from date_in_next_file to latest_file_available will be processed.
.nextfile was not present, data from latest_file_available will be processed, and generate
2020-09-20 11 AM.
./WikiPageviewDB "2020-09-20-00" "2020-09-21-00"
--helpto get information about parameters.
.nextfile, which keeps track of progress for future execution.
.nextfile will fetch one last file and generate the
default_page: Default web page to get page view data. If changed, the target website must have an identical format.
default_log_dir: Default directory to store
default_temp_dir: Default directory to store downloaded file.
page_view_threshold: Threshold to determine if article/pageview data should be stored. I.e.only articles with page view count greater than the threshold is stored.
log_progress_freq: Currently not being used.
max_cpu_core_count: Currently not being used.
temp_auto_delete: Currently not being used.
domain_code: List of domains to allow. E.g.
["en", "en.m"]will only allow data from English Wikipedia and mobile English Wikipedia page.