An exploration of the users responsible for the circulation of fake news Twitter. Specifically looking into how likely is it that an article shared by an user is fake.
Run the scripts for tweet and retweet collection
python collect_tweets.py
python collect_retweets.py
Run the next script to create a list of user ids from the collected tweets and retweets
python create_dataset.py
Use tweego to collect information about the users from their ids
tweego -d "fakenewsnet_dataset" -k "keys.json" -n "all" -u
Filter the dataset to include only users that:
- Have shared more than 2 new articles
- Follow less than or equal to 5000 other users
- Have at least 5000 followers
python filter_dataset.py
The reason behind these constraints is:
- We want users who've shared at least a couple of articles to establish a pattern
- Users who have followed more than 5k users have done so most likely with a bot
- Anyone with more than 5k followers is bound to contribute to news spreading among a large community
Use the tweego tool to construct the user network and store it as a .gml file
tweego -d "fakenewsnet_dataset" -k "keys.json" -n "all_10k" -so -g
Add features to the nodes like, the number of news articles, if a user is verified or not, and the amount of fake news shared as a fraction of the total news
python edit_graph.py
Run the classification scripts in any order
- GNN
- node2vec
Purpose of each notebook-
- GNN - Use different Graph Neural Networks to classify fake and real users
- node2vec - Use node2vec combined with different ML models to classify fake and real users
To build a classification model that would find patterns in ego networks to detect users that share predominantly fake news, a dataset containing edges between users and a database of tweets and retweets that have been manually classified as real or fake is required. Such a dataset does not exist already, but it can be generated.
Tweego is a tool to generate second order ego networks for users from Twitter. This means it'll collect all the friends of friends for a given set of users and generate a graph.
FakeNewsNet is a fake news data repository, which contains two comprehensive datasets that includes news content, social context, and dynamic information. The full paper can be found here. The news is obtained from two fact-checking websites to obtain news with ground truth labels for fake news and true news, these websites are-
- PolitiFact: In PolitiFact, journalists and domain experts review the political news and provide fact-checking evaluation results to claim news articles as fake or real.
- GossipCop: GossipCop is a website for fact-checking entertainment stories aggregated from various media outlets. GossipCop provides rating scores on the scale of 0 to 10 to classify a news story as the degree from fake to real.
The most important feature of FakeNewsNet is that it also downloads tweets and retweets sharing the news articles from Twitter. This means that we can get the profile of users that shared the tweets from Twitter, and then combine it with our list of verified users to see how many fake/real news articles every verified user shared.
Sample size of users: 9687
Friends | Followers | Listed | Statuses | Articles shared | Fake ratio | |
---|---|---|---|---|---|---|
mean | 1648.99 | 257043.19 | 1299.34 | 137100.58 | 17.11 | 0.42 |
median | 1189.00 | 17216.00 | 256.00 | 72894.00 | 5.00 | 0.33 |
min | 0.00 | 5000.00 | 0.00 | 139.00 | 3.00 | 0.0 |
max | 5000.00 | 72123733.00 | 215288.00 | 8318206.00 | 55768.00 | 1.00 |
To create the labels, the ratio of fake news shared to total number of articles shared is considered. The FakeNewsNet dataset contains real and fake news for both Politifact and GossipCop. So first the total number of fake/real news articles a user has shared is calculated by checking how many times their display name or id matches the id or display name of the user sharing a tweet. From this we can get the total number of fake and real news articles a user has shared from both sources(Politifact and GossipCop) and then find the ratio of fake to total news shared.
If more than half the news articles a user has tweeted are fake, then that user is assigned a label of 1, and if less than half are fake, they are given a label of 0. So a label of 1 means the account shares mostly fake news, and a label of 0 means the news shared is mostly real.
Using nucoll it is possible to generate a GML file of a users first and second degree relationships on Twitter. In order to generate the graph, nucoll retrieves the handle's friends (or followers) and all friends-of-friends (2nd degree relationships). It then looks for friend relationships among the friends/followers of the handle.
In this case the handle we supply to nucoll is @verified, and a file with all the 1st and 2nd degree relationships of users that are friends of @verified is generated.
Because of Twitter's very restrictive API rate limits, generating the edge list of all 330k+ verified users is not feasible, so the users are filtered. The following restrictions were applied-
- The user must have shared at least one real, and one fake article
- The user must be following less than 10k people. The reason for this is, it's highly unlikely that a user with more than 10000 friends manually followed so many accounts and they probably used bots.
When these constraints are applied, around 3000 users are left. The edge list for these users is stored in a .gml
file, which can be imported to create a networkx graph.
Two different approaches are taken to build a classification model.
-
Node2vec learns continuous representations for nodes in a graph. The implementation of node2vec used can be found here.
After combining node2vec with the node features, the classifiers trained are-
- Random forest
- SVM
- Logistic regression
- XGBoost
-
GNNs directly operate on the graph structure
- GraphSAGE - Learns the embedding for each node in an inductive way. Each node is represented by the aggregation of its neighborhood. Thus, even if a new node unseen during training time appears in the graph, it can still be properly represented by its neighboring nodes.
- Graph Convolutional Networks - A neural network, designed to work on graphs
For a baseline, the performance of classifiers on just the sentiment and empath features without any network information is taken.
Accuracy | Precision | Recall | f1 Score | |
---|---|---|---|---|
Naive Bayes | 0.771 | 0.660 | 0.660 | 0.660 |
KNN | 0.747 | 0.680 | 0.630 | 0.600 |
Logistic Reg | 0.775 | 0.690 | 0.690 | 0.680 |
SVM | 0.716 | 0.720 | 0.720 | 0.720 |
XGBoost | 0.761 | 0.710 | 0.710 | 0.710 |
Random Forest | 0.767 | 0.660 | 0.660 | 0.660 |
Accuracy | Precision | Recall | f1 Score | |
---|---|---|---|---|
GraphSage | 0.730 | 0.710 | 0.850 | 0.773 |
GCN | 0.671 | 0.654 | 0.827 | 0.709 |
GAT | 0.541 | 0.543 | 0.987 | 0.650 |
Accuracy | Precision | Recall | f1 Score | |
---|---|---|---|---|
Naive Bayes | 0.732 | 0.630 | 0.620 | 0.610 |
KNN | 0.727 | 0.670 | 0.670 | 0.670 |
Logistic Reg | 0.787 | 0.680 | 0.680 | 0.670 |
SVM | 0.728 | 0.730 | 0.730 | 0.720 |
XGBoost | 0.793 | 0.730 | 0.730 | 0.720 |
Random Forest | 0.785 | 0.670 | 0.660 | 0.660 |
The classifiers are trained on the embeddings learnt by the GraphSAGE and GCN models
-
GraphSAGE
Accuracy Precision Recall f1 Score Naive Bayes 0.724 0.720 0.720 0.720 KNN 0.691 0.690 0.690 0.690 Logistic Reg 0.700 0.700 0.700 0.700 SVM 0.719 0.730 0.720 0.710 XGBoost 0.713 0.720 0.710 0.710 Random Forest 0.679 0.680 0.680 0.680 -
GCN
Accuracy Precision Recall f1 Score Naive Bayes 0.702 0.700 0.700 0.700 KNN 0.621 0.670 0.620 0.520 Logistic Reg 0.716 0.730 0.720 0.710 SVM 0.725 0.740 0.730 0.720 XGBoost 0.729 0.730 0.730 0.730 Random Forest 0.695 0.690 0.690 0.690 -
GAT
Accuracy Precision Recall f1 Score Naive Bayes 0.662 0.670 0.660 0.660 KNN 0.646 0.670 0.650 0.530 Logistic Reg 0.717 0.730 0.720 0.710 SVM 0.719 0.740 0.720 0.710 XGBoost 0.724 0.730 0.720 0.720 Random Forest 0.595 0.600 0.600 0.090
Accuracy
Naive Bayes | KNN | Logistic Reg | SVM | XGBoost | Random Forest | |
---|---|---|---|---|---|---|
Baseline | 0.659 | 0.628 | 0.686 | 0.716 | 0.710 | 0.662 |
GraphSAGE | 0.724 | 0.691 | 0.700 | 0.719 | 0.713 | 0.679 |
GCN | 0.702 | 0.621 | 0.716 | 0.725 | 0.729 | 0.695 |
Node2vec | 0.610 | 0.671 | 0.678 | 0.728 | 0.728 | 0.659 |
GAT | 0.662 | 0.646 | 0.717 | 0.719 | 0.724 | 0.595 |
From the table above, it is evident that for classifying users as fake news sources, the structure of the network helps increase the accuracy of classification
If you use this repository in your research please cite
@misc{Saha_TwitterFakeNet_2020,
author = {Saha, Aveek},
month = {3},
title = {{TwitterFakeNet}},
url = {https://github.com/Aveek-Saha/TwitterFakeNet},
year = {2020}
}