Status: Work in Progress
Project Type: Master's Thesis Project
news-crawler-slm
is a framework designed to assist Small Language Models (SLMs) in extracting relevant information from the HTML content of news articles published by any source. This project leverages the Fundus framework, it eases the creation of structured datasets for news articles, which is essential for training and evaluating language models in this domain.
The primary goal of this project is to improve the adaptability of SLMs to various sources of web news content, enabling them to handle diverse styles and structures found across different publishers. This will be achieved by:
- Finetuning SLMs: Training models on datasets derived from Fundus articles to improve their extraction accuracy on unfamiliar publishers.
- Evaluation: Comparing the performance of these finetuned models against similar language models using a benchmark dataset created with Fundus's manual extraction rules.
- Data Extraction: Scripts to extract and preprocess relevant HTML data from Fundus-sourced articles.
- SLM Finetuning: Scripts for model training and finetuning on the prepared datasets.
- Evaluation and Analysis: Methods to evaluate and compare model performance on unseen publishers.
- Clone the repository:
git clone https://github.com/stolzenp/news-crawler-slm.git
- Install dependencies [WIP]:
pip install -r requirements.txt
- Set up the configuration by following the Configuration section.
To perform dataset generation, model training, and evaluation, run the provided scripts in the project root. Commands are documented in this usage guide.
Use the following command to generate a dataset:
python -m data_extraction.crawl_articles
To fine-tune a Small Language Model (SLM), execute:
python -m model_training.finetune_model
To evaluate a resulting model, execute:
python -m model_training.evaluate_model
To customize this project:
- Open
config.json
file - Configure any necessary parameters:
"model_name_or_path": "<model_of_choice>"
This project will welcome contributions after the associated thesis is completed. If you are interested in contributing, please submit an issue or a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.
- Fundus Framework: For providing tools for generating structured datasets essential for model training and evaluation.