A tool to download Wikipedia Pageview data and store it in a database.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Jaeha Choi 3cd226fd0d
Disable debug printing
8 months ago
wikidb Disable debug printing 8 months ago
.gitignore Add build executable to .gitignore 10 months ago
Readme.md Update 'Readme.md' 10 months ago
config.json Fix missing values when processing 8 months ago
main.go Update to pass by reference 10 months ago
namespaces.txt Initial commit 10 months ago

Readme.md

Wikipedia Pageview to Database

What's the use of Wikipedia Pageview?

A quote from Wikipedia/Pageview:

Wikipedia pageviews of certain types of articles correlate with changes in stock market prices, box office success of movies, spread of disease among other applications of datamining. Since search engines directly influence what is popular on Wikipedia such statistics may provide a more unfiltered and real-time view into what people are searching for on the Web and societal interests.

Initial Setup

Database Setup

  1. Create MariaDB database

    • CREATE DATABASE *database_name* DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
  2. Create user

    • CREATE USER '*user*'@'localhost' IDENTIFIED BY '*password*';
  3. Give access

    • GRANT ALL PRIVILEGES ON *databae_name*.* TO '*user*'@'localhost';
  4. Apply access changes

    • FLUSH PRIVILEGES;
  5. (Optional) Depending on the MariaDB version, you would need to run following commands:

    • SET GLOBAL innodb_file_format=Barracuda;
    • SET GLOBAL innodb_file_per_table=ON;
    • SET GLOBAL innodb_large_prefix=1;
  6. Create plain text file called .auth in a same directory as the executable

  7. Add following items to .auth file in order, separated by new line:

    1. Host
    2. Database name
    3. User
    4. Password
    5. Port

WikiPageviewDB Setup

  1. Check if namespaces.txt is in the same directory as the executable.

  2. Run executable with -i or --initialize option to generate default config.json file.

  3. Edit config.json file if necessary. Check Configuration Option to see available options.

  4. Initial setup is complete.

Executing WikiPageviewDB

WikiPageviewDB has three modes.

  1. Passing no arguments will fetch the latest data available, starting from a date in the .next file.
    • ./WikiPageviewDB
      • If .next file was present, data from date_in_next_file to latest_file_available will be processed.
      • If .next file was not present, data from latest_file_available will be processed, and generate .next file.
  2. Passing one argument("DATE") with fetch data for a specified date and time.
    • ./WikiPageviewDB "2020-09-20-11"
      • This will download data for 2020-09-20 11 AM.
  3. Passing two arguments("START_DATE" "END_DATE") with fetch data within the specified dates range, where start date is inclusive and end date is exclusive.
    • ./WikiPageviewDB "2020-09-20-00" "2020-09-21-00"
      • This will download all data available for 2020-09-20.

Miscellaneous

  • Format for a date is YYYY-MM-DD-HH.
  • You can also pass in -h or --help to get information about parameters.
  • All mode creates .next file, which keeps track of progress for future execution.
  • Passing no arguments with no .next file will fetch one last file and generate the .next file.

Configuration (config.json) options

  • default_page : Default web page to get page view data. If changed, the target website must have an identical format.
  • default_log_dir : Default directory to store log.txt file.
  • default_temp_dir : Default directory to store downloaded file.
  • page_view_threshold : Threshold to determine if article/pageview data should be stored. I.e.only articles with page view count greater than the threshold is stored.
  • log_progress_freq : Currently not being used.
  • max_cpu_core_count : Currently not being used.
  • temp_auto_delete : Currently not being used.
  • domain_code : List of domains to allow. E.g. ["en", "en.m"] will only allow data from English Wikipedia and mobile English Wikipedia page.

Dependencies

  1. github.com/PuerkitoBio/goquery
    • Check the latest file available.
  2. github.com/go-sql-driver/mysql
    • Use MariaDB for page view data read/write