|
2 months ago | |
---|---|---|
wikidb | 2 months ago | |
.gitignore | 4 months ago | |
Readme.md | 4 months ago | |
config.json | 2 months ago | |
main.go | 4 months ago | |
namespaces.txt | 4 months ago |
A quote from Wikipedia/Pageview:
Wikipedia pageviews of certain types of articles correlate with changes in stock market prices, box office success of movies, spread of disease among other applications of datamining. Since search engines directly influence what is popular on Wikipedia such statistics may provide a more unfiltered and real-time view into what people are searching for on the Web and societal interests.
Create MariaDB database
CREATE DATABASE *database_name* DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Create user
CREATE USER '*user*'@'localhost' IDENTIFIED BY '*password*';
Give access
GRANT ALL PRIVILEGES ON *databae_name*.* TO '*user*'@'localhost';
Apply access changes
FLUSH PRIVILEGES;
(Optional) Depending on the MariaDB version, you would need to run following commands:
Create plain text file called .auth
in a same directory as the executable
Add following items to .auth
file in order, separated by new line:
Check if namespaces.txt
is in the same directory as the executable.
Run executable with -i
or --initialize
option to generate default config.json
file.
Edit config.json
file if necessary. Check Configuration Option to see available options.
Initial setup is complete.
WikiPageviewDB has three modes.
.next
file.
./WikiPageviewDB
.next
file was present, data from date_in_next_file to latest_file_available will be processed..next
file was not present, data from latest_file_available will be processed, and generate .next
file../WikiPageviewDB "2020-09-20-11"
2020-09-20 11 AM
../WikiPageviewDB "2020-09-20-00" "2020-09-21-00"
2020-09-20
.YYYY-MM-DD-HH
.-h
or --help
to get information about parameters..next
file, which keeps track of progress for future execution..next
file will fetch one last file and generate the .next
file.config.json
) optionsdefault_page
: Default web page to get page view data. If changed, the target website must have an identical format.default_log_dir
: Default directory to store log.txt
file.default_temp_dir
: Default directory to store downloaded file.page_view_threshold
: Threshold to determine if article/pageview data should be stored. I.e.only articles with page view count greater than the threshold is stored.log_progress_freq
: Currently not being used.max_cpu_core_count
: Currently not being used.temp_auto_delete
: Currently not being used.domain_code
: List of domains to allow. E.g. ["en", "en.m"]
will only allow data from English Wikipedia and mobile English Wikipedia page.github.com/PuerkitoBio/goquery
github.com/go-sql-driver/mysql