Skip to content

Trenton-Serpas/Tserpas-project1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project1 wikipedia viewcount

This project takes data files from wikipedia of how many views each page has as retrievable here https://dumps.wikimedia.org/other/pageviews/

It then combines the views from each country and discards any pieces with less than 100 views

This layer of the project gives the top 10 of the resulting page counts in order

Technologies Used

Features

  • Combine the the count of views from each region
  • Discards any pages with less than 100 views
  • Store the results in an easy to retrieve format

To-do list:

  • Implement a Easy to understand script that can be ran for the below lines

How to run

Set up Big Data Europes hive container

git clone https://github.com/big-data-europe/docker-hive && cd docker-hive
docker-compose up -d

copy your previous data to the namenode container then access it and prepare the data

docker cp {YOUR HADOOP NAMENODE CONTAINER ID}:output output
docker exec -it {YOUR docker-hive_namenode} bash
hadoop fs -get output output

Now you can leave the namenode container and run

sbt run

See Hadoop Data processing part of this project!

https://github.com/Trenton-Serpas/Tserpas-project1-hadoop

About

A pipeline using Hive

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages