MCAT: Moderation Capture and Analysis Toolkit

MCAT is a collection of Python functions that collect and analyze platform moderation policies and practices.

Current scripts use Selenium to collect moderation statuses from Facebook, Instagram, and YouTube posts listed in a local CSV. Moderation statuses include "This post is not available" or "This post has been removed for [x] policy infringement." As they are scrapers, these scripts are subject to very frequent changes, so please do expect errors.

Future functions should include:

Collecting and analyzing platform policies from the Platform Governance Archive (v. 1 and 2) and other datasets from peer-reviewed research;
Verifying if a search term is labeled or banned by TikTok;
Getting tweet and other post search rankings to study demotion;

and more.

Installation

Clone the repository and install the required packages:

git clone https://github.com/yourusername/mcat.git
cd mcat
pip install -r requirements.txt

Requirements

The requirements.txt file includes all necessary packages:

pandas
selenium
beautifulsoup4

Additional Requirements

Chrome WebDriver: Download the Chrome WebDriver that matches the version of Chrome installed on your system. You can download it from ChromeDriver. Ensure that the chromedriver executable is in your system’s PATH or specify its location in the script.

Homebrew (if not already installed). Open your command line (e.g. Terminal) and type:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Chrome (if not already installed). Download it from your browser or open your command line (e.g. Terminal) and type:
```
brew install --cask google-chrome
```

Post Status Scrapers

The post status scrapers collect moderation statuses from Facebook, Instagram, and YouTube posts listed in a local CSV.

Usage

Specify the path to your CSV and whether you'd like the script to process all CSVs from a folder or just one file. Please modify the script where indicated by comments.

Run the script:

Execute the script with Python in your Terminal or other command line:

python textdiff_calculator.py

or run it with your local script editor.

TextDiff Calculator

This script calculates the absolute and relative text differences between rows in a policy CSV file or between policy markdown files. It can handle text differences over specified time intervals (day, week, month, or year).

Features

Calculates text differences for a specified column in one or several CSV files stored in the current repository directory.
Calculates text differences for markdown files in the current repository directory.
Outputs a CSV with original columns plus additional columns for text differences, time intervals, and changes (text removals and additions per time interval).

Usage

Edit the script to specify your input type, file path, and parameters.

Open textdiff_calculator.py and modify the following lines as per your requirements:

if __name__ == "__main__":
    # Directly specify your input CSV or markdown directory and parameters here
    input_type = "markdown"  # choose "markdown" or "csv" depending on what kind of policy files you want to process
    input_path = "path/to/your/file/or/folder/with/files"  # path to your CSV file or Markdown format directory

    if input_type == "csv":
        date_column = "Date"  # Specify what the column with dates is called
        text_column = "Text"  # Specify what the column with text is called
        time_interval = "week"  # Choose day, week, month, or year for text differences
        process_csv(input_path, date_column, text_column, time_interval)
    elif input_type == "markdown":
        time_interval = "week"  # Ensure the time interval is consistent
        process_markdown_files(input_path, time_interval)

Run the script:

Execute the script with Python in your Terminal or other command line:

python textdiff_calculator.py

or run it with your local script editor.

Methods

Methods and scripts have been developed by Ivan Kisjes, Emillie de Keulenaar and Frank Anemaet; members of the Open Intelligence Lab & Digital Methods Initiative at the universities of Amsterdam and Groningen; collaborators from various universities (University of Groningen's Centre for Media and Journalism Studies, University of Bremen, PUC Rio, Sciences Po Média Lab); and a little help from GPT-4.

Current scripts follow the workflow described in this chapter. Please cite this chapter when using this repo.

de Keulenaar, E. and Rogers, R. (2024) ‘After deplatforming: the return of trace research for the study of platform effects’, in T. Venturini et al. (eds) The SAGE Handbook of Data and Society: An Interdisciplinary Reader in Critical Data Studies. London: SAGE (to be published in June of 2024).

The majority of moderation policies available in the "Content moderation policies" folder have been made available by the Open Terms Archive and the Platform Governance Archive(s), the latter of which is maintained by the University of Bremen. Please cite them when using policies tagged as "PGA" as follows:

Katzenbach, C., et al. (2023). The Platform Governance Archive. Centre for Media, Communication and Information Research (ZeMKI), University of Bremen. DOI: 10.17605/OSF.IO/XSBPT. URL: https://platformgovernancearchive.org.

More Platform Policy Datasets

For now, MCAT focuses specifically on Community Guidelines. The Open Terms Archive and the Platform Governance Archive have a lot more to offer. Check out:

The Open Terms Archive datasets, including for Generative AI, P2B Compliance, online dating, French elections and digital services, and other policies collected by volunteers.
The Platform Governance Archive, namely PGA v. 1 (YouTube TOS, Facebook, Twitter, and Instagram circa 2005-2021) and PGA v. 2 (everything from Bluesky to WeChat, 2021-ongoing).

Consider citing the PGA 1 archive as follows:

Dataset PGA v1 Katzenbach, C., Kopps, A., Magalhaes, J. C., Redeker.  D., Sühr, T. (2023). Platform Governance Archive (PGA) v1. [data set]. DOI: 10.17605/OSF.IO/XSBPT. URL: https://www.platformgovernancearchive.org/data/dataset-pga-v1-historical-dataset/.

and PGA 2:

Dataset PGAv2 Katzenbach, C., Dergacheva, D., Fischer, A., Kopps, A., Kolesnikov, S., Redeker. D., Viejo Otero, P. (2023). Platform Governance Archive (PGA) v2. [data set]. DOI: 10.17605/OSF.IO/XSBPT. URL: https://www.platformgovernancearchive.org/data/dataset-pga-v2-ongoing-collection/

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Content moderation policies circa 2004-2024		Content moderation policies circa 2004-2024
Post status scrapers		Post status scrapers
chromedriver_mac_arm64		chromedriver_mac_arm64
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCAT: Moderation Capture and Analysis Toolkit

Installation

Requirements

Additional Requirements

Post Status Scrapers

Usage

TextDiff Calculator

Features

Usage

Methods

More Platform Policy Datasets

About

Releases

Packages

Contributors 2

Languages

License

edekeulenaar/MCAT-moderation-analysis-toolkit

Folders and files

Latest commit

History

Repository files navigation

MCAT: Moderation Capture and Analysis Toolkit

Installation

Requirements

Additional Requirements

Post Status Scrapers

Usage

TextDiff Calculator

Features

Usage

Methods

More Platform Policy Datasets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages