MAGPIE: Multi-Task Media-Bias Analysis of Generalization of Pre-Trained Identification of Expressions

This repository contains all resources from the paper "MAGPIE: Multi-Task Media-Bias Analysis of Generalization of Pre-Trained Identification of Expressions". MAGPIE is the first large-scale multi-task learning (MTL) approach for detecting media bias. To train MAGPIE the LBM (Large Bias Mixture), a comprehensive pre-training composition of 59 bias-related tasks encompassing linguistic bias, gender bias, group bias, and others is introduced.

1. Getting started

2. MTL Framework

1. Datasets
2. Training

3. Reproduce the results

4. Run your experiments

5. Citation

1. Getting started

Install python dependencies

In order to be able to use the framework or run an inference, please first install python dependencies via following:

pip install -r requirements.txt

wandb.ai API Access

Our training framework uses Weights&Biases to track all experiments. Please add your API KEY to local.env file. You can get your API KEY for free at wandb.ai.

2. MTL Framework

This repository contains code for training models in a multi-task learning fashion through the MTL framework.

Datasets

We make our Large Bias Mixture (LBM) collection available in datasets directory. All datasets are in processed and cleaned state. Each datasets has acording preprocessor class under preprocessing directory and according script for preprocessing under scripts/preprocessors directory. However, preprocessing concerns the raw data, that can be found under our huggingface repository PLACEHOLDER FOR ANONYMOUS SUMBISSION.

Training

├─ training
     |
     ├─── data
     |
     ├─── model
     |    └─── optimization
     |    
     ├─── trainer
     |
     └─── tokenizer
          └─── mb-mtl-tokenizer

Training subdirectory consists of three main components:

data directory contains data structures
model directory contains definition of model architecture and classes for gradient manipulation.
trainer contains a main trainer.py class which orchestrates the whole multi-task training.

For further details please refer to training directory.

3. Reproduce the results

All experiments can be reproduced via running scripts in scripts directory. Each subdirectory in scripts/ has a run_experiment.py file defining the whole experiment.

scripts/ablation_study/ contains an evaluation of HSES and Resurrection optimization strategies
scripts/gradts_task_selection contains four-step pipeline for selecting the auxiliary tasks based on GradTS algorithm
scripts/hyperparameter_tuning contains hyperparameter search for option for robust selection of hyperparameters
scripts/lbm_taxonomy_analysis contains a script for co-training tasks based on task families
scripts/evaluation_robust contains final MAGPIE evaluation over 30 random seeds

4. Run your experiments

Running your own experiments can be done on multiple degrees of customization. You can customize the training based on the adjustments listed below

Add your own datasets/tasks
- put your dataset into /datasets/YOUR_DATASET/ folder
- define your dataset as a learning task in training initialization file
Define your own task-specific head in model heads class. Classification, Regression and Language Modelling tasks are implemented
Choose encoder-only model of your choice and define it in enums/model_checkpoints
Adjust the fixed training parameters (e.g., MAX_NUMBER_OF_STEPS, random seed, etc.) in config.py

Write your own execution script choosing desired training parameters. An example:

 import wandb
 from config import head_specific_lr, head_specific_max_epoch, head_specific_patience
 from enums.aggregation_method import AggregationMethod
 from enums.model_checkpoints import ModelCheckpoint
 from enums.scaling import LossScaling
 from training.data import YOUR_TASK_A,YOUR_TASK_B,YOUR_TASK_C
 from training.model.helper_classes import EarlyStoppingMode, Logger
 from training.trainer.trainer import Trainer
 from utils import set_random_seed

 EXPERIMENT_NAME = "EXPERIMENT NAME"


 tasks = [YOUR_TASK_A,YOUR_TASK_B,YOUR_TASK_C]

 for t in tasks:
   for st in t.subtasks_list:
       st.process()

 config = {
   "sub_batch_size": 32,
   "eval_batch_size": 128,
   "initial_lr": 4e-5,
   "dropout_prob": 0.1,
   "hidden_dimension": 768,
   "input_dimension": 768,
   "aggregation_method": AggregationMethod.MEAN,
   "early_stopping_mode": EarlyStoppingMode.HEADS,
   "loss_scaling": LossScaling.STATIC,
   "num_warmup_steps": 10,
   "pretrained_path": None,
   "resurrection": True,
   "model_name": "YOUR_MODEL_NAME",
   "head_specific_lr_dict": head_specific_lr,
   "head_specific_patience_dict": head_specific_patience,
   "head_specific_max_epoch_dict": head_specific_max_epoch,
   "logger": Logger(EXPERIMENT_NAME),
 }


 set_random_seed() # default is 321
 wandb.init(project=EXPERIMENT_NAME,name="YOUR_MODEL_NAME")
 trainer = Trainer(task_list=tasks, LM=ModelCheckpoint.ROBERTA, **config)
 trainer.fit()
 trainer.eval(split=Split.TEST)
 trainer.save_model()
 wandb.finish()

3. Citation

Please cite us as:

@inproceedings{Horych2024a,
title = {MAGPIE: Multi-Task Analysis of Media-Bias Generalization with Pre-Trained Identification of Expressions},
author = {Tomas Horych and Martin Wessel and Jan Philip Wahle and Terry Ruas and Jerome Wassmuth and Andre Greiner-Petter and Akiko Aizawa and Bela Gipp and Timo Spinde},
url = {https://media-bias-research.org/wp-content/uploads/2024/04/Horych2024a.pdf},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
booktitle = {"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation"},
keywords = {nlp,bias},
pubstate = {published},
tppubtype = {inproceedings}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
datasets		datasets
enums		enums
logging		logging
preprocessing		preprocessing
scripts		scripts
tests		tests
training		training
tweet_fetching		tweet_fetching
LICENSE		LICENSE
README.md		README.md
config.py		config.py
init_wandb.py		init_wandb.py
local.env		local.env
magpie_demo_inference.py		magpie_demo_inference.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAGPIE: Multi-Task Media-Bias Analysis of Generalization of Pre-Trained Identification of Expressions

1. Getting started

Install python dependencies

wandb.ai API Access

2. MTL Framework

Datasets

Training

3. Reproduce the results

4. Run your experiments

3. Citation

About

Releases

Packages

Languages

License

Media-Bias-Group/magpie-multi-task

Folders and files

Latest commit

History

Repository files navigation

MAGPIE: Multi-Task Media-Bias Analysis of Generalization of Pre-Trained Identification of Expressions

1. Getting started

Install python dependencies

wandb.ai API Access

2. MTL Framework

Datasets

Training

3. Reproduce the results

4. Run your experiments

3. Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages