Wikipedia Big Data Analysis

Project Description

A Scala program that takes a uploaded file in a hive table and queries. Through a reactive CLI prompt the user can select that queries Hadoop Hive. Hive will then do a map reduce through yarn on data in hdfs.

Technologies Used

Scala - version 2.12.3
Java - version 11.0.9
Hadoop - version 2.7.4
Hive - version 3.2.1
Docker-compose - version 1.27.4

Features

Read English Wikipedia Clickstream data from an HDFS cluster.
Return a dataset different queries. Example: What is the most clicked Wikipedia page? What is the most clicked Wikipedia to Wikipedia page?
Uses Hive by endpoint created by docker.

To-do list:

Reduce the data through a map reduce before uploading the data in HDFS for faster quering.

Getting Started

git clone [email protected]:revature-scalawags/speart-project-1.git

Ports that must be availiable:

50075
50070
10000
9083
8080

Common Trouble Shooting if ports are unavailable

net stop winnat
docker ps #Then stop the processes that are overtaking the ports

Usage

Set up the Hive single cluster and database

Bash
```
sh database.sh
```
Windows Command Propt
```
./database.sh
```
Run
```
sbt run
```

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
createDB.sql		createDB.sql
database.sh		database.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Big Data Analysis

Project Description

Technologies Used

Features

Getting Started

Usage

About

Releases

Packages

Languages

revature-scalawags/speart-project-1

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Big Data Analysis

Project Description

Technologies Used

Features

Getting Started

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages