In this project we have compared the properties of Clickbait titles vs Non-Clickbait titles. We have then classified the data using simple classifier models such as SVMs, Logistic Regression and XGBoost.
The project is done in Hindi. Created for the course project of the Spring '21 Course - Computational Linguistics-1.
- Preprocess the dataset
- Analysis of data
- Plotting graphs of all the analysis.
- Making word clouds for the different name entities present.
- Comparing the list of entities present in Clickbait vs Non-Clickbait
Analysis is done on the basis of-
- Number of Tokens
- Presence of Question Marks
- Presence of Exclamation Marks
- Presence of Quotations
- Number of Stopwords
- Presence of Numerals
- The Entities Present
- POS Tags
In Clickbait.ipynb all the above analysis have been made for Clickbait vs Non-Clickbait sentences.
Classifier.ipynb contains the scaled version of data generated and code to make predictions on the test dataset. We use >70% of the training data for training the models and the rest to verify the accuracy of the models.
Dataset containing 41800 Hindi sentences labelled either 0(Non-Clickbait) or 1(Clickbait).
We have used
pandas
to analyse datare
to clean the dataNLTK
to tokenize- A custom stopword file (stopword.txt) to identify stopwords
polyglot
library for NERwordcloud
python package andmathplotlib.pyplot
to make wordcloudsstanza
to POS tagmathplotlib
andseaborn
have been used to make the graphs.
Highest accuracy achieved was: 0.6874
Highest accuracy achieved was: 0.7068
Highest accuracy achieved was: 0.7001
The report.pdf contains the analysis of the graphs obtained in the process and some other observations.