Actually the data cleaning process was amostly done in the Data source section, still there are several things to deal with:
- the release days are negative before the release date, and those items should be eliminated while their box offices added to the next item.
- the countries of film are in chaos, therefore the nationality are determined according to the production company, producer, screenwriter and director.
- main data: 中国票房
Crawl data of top 25 films in box office every year from 2008 to 2017, including film names, categories, countries, box offices, ticket prices and dates. - contrasting data: Mtimes(时光网)
In order to ensure the data is reliable, the total box office of every film is contrasted between 中国票房 and Mtimes. If the difference is more than 8%, the box data should be compared manually with the third site (IMDb for example).
The disadvantage of Mtimes: There is no weekly box offices data. - corrected data: IMDb
In some cases, especially foreign films, the weekly data is dismissed. The IMDb data about the films is used to correct when this happens.
The disadvantage of IMDb: Some of the China films data is dismissed.