CLI application built to use Hive for querying a Yarn cluster running on a local container to answer 4 interesting questions about popular and not so popular movies.
GroupLens Research has collected and made available their ratings datasets available at their MovieLens Website. The MovieLens 25M movie ratings stable benchmark dataset describes 5-star ratings and free-text tagging activity. 25,000,095 ratings and 1,093,360 tag applications are applied to 62,423 movies by 162,541 users. It includes tag genome data with 15 million relevance scores across 1,129 tags. The data was generated between January of 1995 and November of 2019. Released 11/2019.
This application uses Hive ontop of a Yarn cluster to query the MovieLens dataset and answers the following questions:
- What are the most popular movies ever?
- What are the 'worst' popular movies?
- What are some good however, unpopular movies?
- What movies correlate closely to their tag descriptions?
- Scala and SBT: https://www.scala-lang.org/download/2.12.8.html
- JDK (v11): https://jdk.java.net/15/
- Hive-jdbc driver via library dependency: v3.1.2
These datasets can be acquired from movielens.
the dataset used was their 25M Dataset. The README for this data can be viewed here.
View files needed in your hdfs:
This application will look for your Hive cluster running on the default http://localhost:10000.
No username or password is required.
Repo
My Github
Email: [email protected]
None known at the moment.
If any are discovered, please feel free to contact me. Cheers. 😄