Skip to content

dusdoom/movie_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Movie review scraper

not sure if this could be useful for someone so i will just throw it here

About

A very rough code for a webscraper made with the intent of building a dataset of movie reviews from multiple sources to be used some projects, mainly a movie recommendation system. Each spider should be given a list of url or implement it's own URL builder function that will scrape data based on a predefined flow, parsing all the reviews to the same format.

Documentation

This documentation should contain resources generic enough to implement the same spider on most review websites using any library, framework, language or tool. For my case, i choose to use the Scrapy framework. overview

Scraping flow

The flowchart bellow is supposed to represent the process for each movie and should be compatible with most website review structure. This does not mean that a spider should be responsible for every website (which should be pretty complex to do) but declares a step-by-step to be performed on each one. scraping flowchart

"Top user" should be a definition of how prestigious is the user as a movie reviewer. Each source should have it's own algorithm to determine this. While this could mean being a Top Critic in Rotten Tomatoes, it could be a user with more than 200 reviews in AdoroCinema. Altough this process seems to be part of the cleaning process, defining top users during the scraping stage helps to prevent unnecessary pages to be parsed, speeding up the process and saving some bandwich.

Data structure

This is supposed to represent the data scrapped, focusing solely on the the reviews and users. Later on, this could be incremented with movie metadatas from a open source API or be stored in a database with movieName_releaseDate as a ID, for example.

Field Description
sentiment The sentiment of the review text(positive, negative). Most times this will be based on the rating
userUrl The URL of the user who wrote the review
movieName The name of the movie
movieYear Release year of the movie
rating Original and normalized rating from the review (e.g. 8.5/10, 4/5, A+, C.).
text The review text
userName The user name
isRelevantRating Checks if the review is highly relevant to the movie, e.g. a review from a top critic or has multiple likes
source The source of the review (e.g. IMDB, Rotten Tomatoes, etc.)

Cleaning pipeline

Rotten tomatoes

overview

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages