This project takes data files from wikipedia of how many views each page has as retrievable here https://dumps.wikimedia.org/other/pageviews/
It then combines the views from each country and discards any pieces with less than 100 views
This layer of the project gives the top 10 of the resulting page counts in order
- Hive
- Big-Data-Europe's Hive docker container - https://github.com/big-data-europe/docker-hadoop
- More in 1st part at the link at the bottom of this README
- Combine the the count of views from each region
- Discards any pages with less than 100 views
- Store the results in an easy to retrieve format
To-do list:
- Implement a Easy to understand script that can be ran for the below lines
Set up Big Data Europes hive container
git clone https://github.com/big-data-europe/docker-hive && cd docker-hive
docker-compose up -d
copy your previous data to the namenode container then access it and prepare the data
docker cp {YOUR HADOOP NAMENODE CONTAINER ID}:output output
docker exec -it {YOUR docker-hive_namenode} bash
hadoop fs -get output output
Now you can leave the namenode container and run
sbt run