Skip to content
This repository has been archived by the owner on May 5, 2023. It is now read-only.

Latest commit

 

History

History
115 lines (90 loc) · 3.91 KB

README.md

File metadata and controls

115 lines (90 loc) · 3.91 KB

Clickbait Classifier

This is a very simple attempt at classifying article titles into one of two groups: "clickbait" (a la Buzzfeed and Clickhole) or "news" (a la The New York Times). I was curious if this could be done accurately; I can't think of a good definition for "clickbait" but I know it when I see it.

Setup

poetry

If you have poetry installed, you shouldn't have to do a thing. You can install all necessary dependencies and run the demos with poetry run:

# train the classifier and show the top features
poetry run python -m clickbait_classifier.classifier
# enter an interactive classifier loop
poetry run python -m clickbait_classifier.interactive

pip

If you don't use poetry, you can create a virtualenv, install the dependencies, and then run the code with pip:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python -m clickbait_classifier.classifier
python -m clickbait_classifier.interactive

nix

If you have nix, you can use nix-shell or nix develop or direnv or lorri to get all the necessary dependencies, including Poetry.

If you use flakes, you can run the demos without installing anything:

# train the classifier and show the top features
nix run github:peterldowns/clickbait-classifier#classifier
# enter an interactive classifier loop
nix run github:peterldowns/clickbait-classifier#interactive

Usage

The code is pretty messy, but the general idea is that there is some article data in the data/ directory, and classifier.py uses this for training. You can download more data from Buzzfeed and Clickhole using the tools in scripts/.

python ./scripts/scrape_buzzfeed.py > ./clickbait_classifier/data/buzzfeed2.json  
python ./scripts/scrape_clickhole.py > ./clickbait_classifier/data/clickhole2.json  

If you feel like testing a few article titles, you can get a simple testing loop like so:

python ./clickbait_classifier/interactive.py

This will load the classifier, train it, and then present you with a simple loop where you can paste in article titles and see the results. You can quit using c-C. For example:

clickbait-classifier/ $ ./interactive.py
Loading classifier (may take time to train.)
Classification report:
             precision    recall  f1-score   support

  clickbait       0.91      0.62      0.74       172
       news       0.90      0.98      0.94       621

avg / total       0.91      0.91      0.90       793


  -9.0500 10 things         -5.3044 new
  -9.0500 11 things         -5.7492 bush
  -9.0500 13 times          -5.8460 overview
  -9.0500 15 times          -5.9519 iraq
  -9.0500 19 puppies        -5.9645 war
  -9.0500 2014              -5.9828 president
  -9.0500 2015              -5.9852 clinton
  -9.0500 21                -6.1021 special
  -9.0500 23 life           -6.1206 nation
  -9.0500 24                -6.1464 report
  -9.0500 25                -6.1778 campaign
  -9.0500 27                -6.2223 china
  -9.0500 33                -6.2880 york
  -9.0500 35                -6.2880 new york
  -9.0500 90s               -6.2994 plan
  -9.0500 90s kid           -6.3191 special report
  -9.0500 90s kids          -6.3523 says
  -9.0500 90s kids rejoice    -6.4277 big
  -9.0500 90s sitcom        -6.4423 challenged
  -9.0500 absolute          -6.4465 house
Done.

Article title: 43 Reasons 2014 Was The Best Year Ever To Be A Nerd
(95.13% clickbait, 4.87% news) -> clickbait

Article title: Protesters And Police Clash In Missouri For A Second Night
(19.32% clickbait, 80.68% news) -> news

Article title: 29 Christmas Vines That Will Make You Laugh Every Time
(88.25% clickbait, 11.75% news) -> clickbait

Article title: New Subprime Boom Ties Risky Loans to Car Titles
(10.98% clickbait, 89.02% news) -> news

Article title: ^C