Skip to content

Latest commit

 

History

History
19 lines (10 loc) · 1.04 KB

README.md

File metadata and controls

19 lines (10 loc) · 1.04 KB

IntroToMapReduceAndSpark

This project carries out three tasks:

  1. Task1 : Explore the randomly sampled Yelp Challenge dataset using Spark to identify the following:

    1. The total number of reviews

    2. The number of reviews in a given year

    3. The number of distinct users who have written the reviews

    4. Top m users who have the largest number of reviews and its count

    5. Top n frequent words in the review text. The words should be in lower cases. The following punctuations “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)” and the given stopwords are excluded.

  2. Task2: Explore two datasets (review and business) together and compute the average stars for each business category and output top n categories with the highest average stars. Additionally, an implementation with and without Spark is required to understand performance of Spark.

  3. Task3: Build a custom partitioner to improve computational efficiency and compute the businesses that have more than n reviews in the review file.