BERTu Ġurnalistiku Intermediate pre-training of BERTu on news articles and fine-tuning for Question Answering using SQuAD
Final Year Project submitted in partial fulfilment of the requirements for the degree of B.Sc. I.T. (Hons.) in Software Development.
This project aims to address the scarcity of resources for Maltese, being a low-resource language, in Artificial Intelligence, focusing on extractive Question Answering and domain-oriented models. Despite the widespread availability of language models and related resources, such as datasets, in languages such as English, the absence of similar resources for Maltese presents a significant gap that this research project seeks to address.
Firstly, the project compiled a corpus of Maltese news articles to serve as the basis for further training Maltese language models on text pertaining to a particular domain of knowledge. Secondly, it translated the Stanford Question Answering datasets to Maltese to train Maltese language models for extractive Question Answering and addressed the lack of a state-of-the-art dataset to achieve this. Furthermore, the study also made use of a manual approach to enrich the resulting translated dataset. Subsequently, the project fine-tuned Maltese Large Language Models to perform extractive Question Answering. Moreover, the study further pre-trained a Maltese language model on the corpus of news articles and then on the Stanford Question Answering datasets to analyse if further pre-training the model on text pertaining to a particular domain benefits the models' performance in extractive Question Answering. Through a suite of experiments and optimisations, the performance of the developed models in Maltese was evaluated against state-of-the-art models in English. These experiments delve into the impacts of the pre-training corpus that teaches the language model the fundamentals of the language, further pre-training on a corpus pertaining to a domain of knowledge such as news articles and the size of the fine-tuning dataset to train a model for a task have on the performance of a Maltese language model on extractive Question Answering.
Furthermore, the project developed a chat User Interface capable of leveraging the Maltese Large Language Models trained for extractive Question Answering for efficient, transparent, and accessible inference even for non-technical users.
Overall, this project contributed to the advancement of Artificial Intelligence in Maltese, particularly in Question Answering and domain-based models, expanding the accessibility and applicability of Artificial Intelligence.