Skip to content

krishna555/IntroToMapReduceAndSpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

IntroToMapReduceAndSpark

This project carries out three tasks:

  1. Task1 : Explore the randomly sampled Yelp Challenge dataset using Spark to identify the following:

    1. The total number of reviews

    2. The number of reviews in a given year

    3. The number of distinct users who have written the reviews

    4. Top m users who have the largest number of reviews and its count

    5. Top n frequent words in the review text. The words should be in lower cases. The following punctuations “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)” and the given stopwords are excluded.

  2. Task2: Explore two datasets (review and business) together and compute the average stars for each business category and output top n categories with the highest average stars. Additionally, an implementation with and without Spark is required to understand performance of Spark.

  3. Task3: Build a custom partitioner to improve computational efficiency and compute the businesses that have more than n reviews in the review file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published