The purpose of this project is to generate a scraper function that allows the user to input any Rotten Tomatoes movie URL and returns all the available information on Rotten Tomatoes about the specified movie. In the end, our goal for the project is:
- To be able to have database of different movie names and a Python script that iterates through the database.
- At each iteration, we will construct the appropriate URL for the movie name
- Using the constructed URL, we will feed it into the function named
printData
that will return the information about the movie
- Using the library
BeautifulSoup
as a way to parse through HTML elements, we were able to precisely pinpoint the relevant information we want to extracted from the given movie. Next, we generalized the method functionality to be able to take in movies URL and named the methodprintData
which is created insidedataScraping.py
- We then found a dataset named
movies_data.csv
that contains 45,000 movie names, and by parsing through the dataset, we can extract the movie name and use those names to construct the URL that goes into ourprintData
method. Since Rotten Tomatoes does not have one set of naming convention for all its movies, there are many conventions that we found for the majority of the movies and thus, we use five try/excerpt blocks nested in each other so that if one convention is wrong, we can try other conventions. Consequently, if all conventions fail, we will put the index of the movie in themovies_data.csv
inside theerror.txt
file and carry on to the next movie down the list. In the end all of the movies that were successfully scraped will be stored inside the file nameddata-scraped.csv
. All these processes are illustrated in thedataDownload.py
. - Since our datascraping took a considerably long time, we came up with the solution to run multiple scrapers simultaneously to speed up the scraping processes. As a result, it we ran seven scrapers in parallel with each other and our total data scraping time was around 2 hours and 30 minutes with each processes scraped on around 6,430 movies. All of the initial scraped files are stored inside the
Scraped Files
folder, each have the naming convention of:[starting index]-[ending index - 1].csv
. - We then combined all the segmented datasets that were successfully scraped together into the
combined.csv
inside theScraped Files
folder.
- Simply fork this repository into your computer
- Perform
pip install bs4
for BeautifulSoup andpip install pandas
for Pandas dataframe on your terminal. - Go into
dataSraping.py
and under the sectionRun First
, uncomment this part of the code and run this script by typingpython3 dataScraping.py
to make sure there is no error in the method. If you did not encountered any error, proceed to the next step but if you did please read below.- If there is a
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
, this is most likely because you have don't have the appropriate certification. Follow this stackOverflow Link if you did not install your Python usinghomebrew
and if you usedhomebrew
then follow this stackOverflow Link. For the latter option, there iscertificate.py
for usage to make it faster to have the needed certifications.
- If there is a
- Then to see the datascraper going through the list of movies inside
movies_data.csv
, runpython3 dataDownload.py
- Inside this repository, a Jupyter Notebook with the name
websiteScraper.ipynb
was created as a way to have faster demonstration of what was described above without forking the repository. Simply click on the notebook and it will direct you to a static HTML page and you can see the code structure and layout. - We managed to successfully scrape 18,445 movies out of 45,000 movies inside the
movies_data.csv
database which is equivalent to a 40.9889% success rate for our scraper.
- The main problem that we struggled with throughout this project is to find the appropriate conventions to increase our scraper success rate, however as there are so many conventions, it was difficult to capture all of them.
- Additionally, we found that for the majority of the movies we scraped, there are a lot of attributes that were missing from the movies and when this happens, the scraper puts "nothing" in the appropriate data slots. Below is an example of a movie that has all the attributes:
While a bad example would look like this:
The bad examples shows missing values such as the Critics Census, Tomatometer and Total Count which were presented in multiple moves that we scraped. Therefore, we went over ourcombined.csv
file and cleaned it into thecombined_cleaned.csv
where in the new cleaned files, only images with the majority of the neccessary atrributes such as Critics Census, Tomatometer and Total Count can be inside. As a result, our scraped movies went from 18,445 to 11,382 movies after we cleaned thecombined.csv
file.
- We hoped that by having a dataset of 11,382 movies we would be able to construct a regressional analysis on the possible factors that made the audience rating high for a certain movie and low for the other.
- Additonally, we are interested utilizing OpenAI GPT-2 language modules model that can construct a synopsis using an invented movie title, critics rating and audience score.
- Trung Bui
- Minh Hua
- Quang Pham