Skip to content

Text Mining Project IS450 (Legal Text Summarization and Topic Modelling)

Notifications You must be signed in to change notification settings

dian-farah/IS450TextMiningG2T2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IS450TextMiningG2T2

Text Mining Project IS450 (Legal Text Summarization and Topic Modelling)

Table of contents

General info

Document review in law is the last stage before production. It is important to help lawyers understand the cases and formulate theories for trial. It can also be used for a broader purpose, such as carrying out due diligence assessments. It is considered the most intensive and costly stage, taking up 70% to 80% of the cost of ediscovery. To control costs, reviewers have to ensure they narrow their scope in order to minimize the number of documents needed to be reviewed.

To improve the process, we decided to perform topic modelling and extractive summarization so that lawyers can have a brief overview on the legal documents. The results of various models will be compared against each other to determine the best performing models.

Technologies

Project is created with:

  • Python
  • NLTK
  • Gensim
  • Sumy
  • ROUGE
  • Scikit-learn
  • Spacy

Analysis

We used 9 extractive summarization algorithms and analysed their Rouge scores. Out of all the algorithms, Edmundson (Cue) method and BERT-Extractive method are the best performing ones (Table 1). Overall, extractive summarization did not perform very well, with Rouge scores below 50%. This could be because our reference summary in our dataset was heavily paraphrased. To improve on this, we can explore abstractive summarization techniques or combined summarization, where we shorten the sentences after summarising.

Table 1

image

From our topic modelling implementation, we found an overlap in topics related to transportation and security, as well as wildlife and environmental conservation (Image 1 and 2) for both models. We realised that LDA's results focuses heavily on employment, government, grants and taxes while BERTopic focuses more on education, technology and immigration.

Image 1

image

Image 2

image

When analysing the models' performance, BERTopic performed better with a higher coherence score of 0.61276, compared to LDA of 0.48869. This could be because LDA ignores semantic relationships among words due to its Bag-of-Words representation. On the other hand, BERTopic can be fitted to advanced embedding models which are used to cluster sematically similar documents. Another reason for BERTopic's high coherence score could be its use of soft clustering methods where documents that are not clustered into any topic will be grouped under Topic -1. This is not favourable for us as we would like all documents to be under a spceified topic. Human evaluation is needed to better assess the topic modelling performance and coherence.

Additional Files

About

Text Mining Project IS450 (Legal Text Summarization and Topic Modelling)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published