Big Data project: Designed and answered good data science questions for the yelp unstructured big dataset.
• Scope: Designed and answered good data science questions for the Yelp unstructured big dataset (~8 GB).
• The packages used: Apache Spark/Hadoop (PySpark).
• Environment: Linux Bash.
• Challenge: accessing JSON-objects due to nesting.
• Solution: dictionaries made indexing simple.
• Results: gained useful insights about reviews/businesses/users, e.g.: number of reviews is weakly correlated with the number of fans.