IS450TextMiningG2T2

Text Mining Project IS450 (Legal Text Summarization and Topic Modelling)

General info

Document review in law is the last stage before production. It is important to help lawyers understand the cases and formulate theories for trial. It can also be used for a broader purpose, such as carrying out due diligence assessments. It is considered the most intensive and costly stage, taking up 70% to 80% of the cost of ediscovery. To control costs, reviewers have to ensure they narrow their scope in order to minimize the number of documents needed to be reviewed.

To improve the process, we decided to perform topic modelling and extractive summarization so that lawyers can have a brief overview on the legal documents. The results of various models will be compared against each other to determine the best performing models.

Technologies

Project is created with:

Python
NLTK
Gensim
Sumy
ROUGE
Scikit-learn
Spacy

Analysis

We used 9 extractive summarization algorithms and analysed their Rouge scores. Out of all the algorithms, Edmundson (Cue) method and BERT-Extractive method are the best performing ones (Table 1). Overall, extractive summarization did not perform very well, with Rouge scores below 50%. This could be because our reference summary in our dataset was heavily paraphrased. To improve on this, we can explore abstractive summarization techniques or combined summarization, where we shorten the sentences after summarising.

Table 1

From our topic modelling implementation, we found an overlap in topics related to transportation and security, as well as wildlife and environmental conservation (Image 1 and 2) for both models. We realised that LDA's results focuses heavily on employment, government, grants and taxes while BERTopic focuses more on education, technology and immigration.

Image 1

Image 2

When analysing the models' performance, BERTopic performed better with a higher coherence score of 0.61276, compared to LDA of 0.48869. This could be because LDA ignores semantic relationships among words due to its Bag-of-Words representation. On the other hand, BERTopic can be fitted to advanced embedding models which are used to cluster sematically similar documents. Another reason for BERTopic's high coherence score could be its use of soft clustering methods where documents that are not clustered into any topic will be grouped under Topic -1. This is not favourable for us as we would like all documents to be under a spceified topic. Human evaluation is needed to better assess the topic modelling performance and coherence.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
BERT Topic Modeling		BERT Topic Modeling
EDA		EDA
LDA Model Topic modelling		LDA Model Topic modelling
Summarisation		Summarisation
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IS450TextMiningG2T2

Table of contents

General info

Technologies

Analysis

Additional Files

About

Releases

Packages

Contributors 5

Languages

dian-farah/IS450TextMiningG2T2

Folders and files

Latest commit

History

Repository files navigation

IS450TextMiningG2T2

Table of contents

General info

Technologies

Analysis

Additional Files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages