IntroToMapReduceAndSpark

This project carries out three tasks:

Task1 : Explore the randomly sampled Yelp Challenge dataset using Spark to identify the following:
1. The total number of reviews
2. The number of reviews in a given year
3. The number of distinct users who have written the reviews
4. Top m users who have the largest number of reviews and its count
5. Top n frequent words in the review text. The words should be in lower cases. The following punctuations “(”, “[”, “,”, “.”, “!”, “?”, “:”, “;”, “]”, “)” and the given stopwords are excluded.
Task2: Explore two datasets (review and business) together and compute the average stars for each business category and output top n categories with the highest average stars. Additionally, an implementation with and without Spark is required to understand performance of Spark.
Task3: Build a custom partitioner to improve computational efficiency and compute the businesses that have more than n reviews in the review file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
python		python
scala		scala
README.md		README.md

Provide feedback