This project carries out three tasks:
-
Task1 : Explore the randomly sampled Yelp Challenge dataset using Spark to identify the following:
-
The total number of reviews
-
The number of reviews in a given year
-
The number of distinct users who have written the reviews
-
Top m users who have the largest number of reviews and its count
-
Top n frequent words in the review text. The words should be in lower cases. The following punctuations “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)” and the given stopwords are excluded.
-
-
Task2: Explore two datasets (review and business) together and compute the average stars for each business category and output top n categories with the highest average stars. Additionally, an implementation with and without Spark is required to understand performance of Spark.
-
Task3: Build a custom partitioner to improve computational efficiency and compute the businesses that have more than n reviews in the review file.