IntroToMapReduceAndSpark

This project carries out three tasks:

Task1 : Explore the randomly sampled Yelp Challenge dataset using Spark to identify the following:
1. The total number of reviews
2. The number of reviews in a given year
3. The number of distinct users who have written the reviews
4. Top m users who have the largest number of reviews and its count
5. Top n frequent words in the review text. The words should be in lower cases. The following punctuations “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)” and the given stopwords are excluded.
Task2: Explore two datasets (review and business) together and compute the average stars for each business category and output top n categories with the highest average stars. Additionally, an implementation with and without Spark is required to understand performance of Spark.
Task3: Build a custom partitioner to improve computational efficiency and compute the businesses that have more than n reviews in the review file.

Provide feedback