Skip to content

Latest commit

 

History

History
43 lines (32 loc) · 1.33 KB

README.md

File metadata and controls

43 lines (32 loc) · 1.33 KB

Project1 wikipedia viewcount

This project takes data files from wikipedia of how many views each page has as retrievable here https://dumps.wikimedia.org/other/pageviews/

It then combines the views from each country and discards any pieces with less than 100 views

This layer of the project gives the top 10 of the resulting page counts in order

Technologies Used

Features

  • Combine the the count of views from each region
  • Discards any pages with less than 100 views
  • Store the results in an easy to retrieve format

To-do list:

  • Implement a Easy to understand script that can be ran for the below lines

How to run

Set up Big Data Europes hive container

git clone https://github.com/big-data-europe/docker-hive && cd docker-hive
docker-compose up -d

copy your previous data to the namenode container then access it and prepare the data

docker cp {YOUR HADOOP NAMENODE CONTAINER ID}:output output
docker exec -it {YOUR docker-hive_namenode} bash
hadoop fs -get output output

Now you can leave the namenode container and run

sbt run

See Hadoop Data processing part of this project!

https://github.com/Trenton-Serpas/Tserpas-project1-hadoop