Text Mining Project IS450 (Legal Text Summarization and Topic Modelling)
Document review in law is the last stage before production. It is important to help lawyers understand the cases and formulate theories for trial. It can also be used for a broader purpose, such as carrying out due diligence assessments. It is considered the most intensive and costly stage, taking up 70% to 80% of the cost of ediscovery. To control costs, reviewers have to ensure they narrow their scope in order to minimize the number of documents needed to be reviewed.
To improve the process, we decided to perform topic modelling and extractive summarization so that lawyers can have a brief overview on the legal documents. The results of various models will be compared against each other to determine the best performing models.
Project is created with:
- Python
- NLTK
- Gensim
- Sumy
- ROUGE
- Scikit-learn
- Spacy
We used 9 extractive summarization algorithms and analysed their Rouge scores. Out of all the algorithms, Edmundson (Cue) method and BERT-Extractive method are the best performing ones (Table 1). Overall, extractive summarization did not perform very well, with Rouge scores below 50%. This could be because our reference summary in our dataset was heavily paraphrased. To improve on this, we can explore abstractive summarization techniques or combined summarization, where we shorten the sentences after summarising.
Table 1
From our topic modelling implementation, we found an overlap in topics related to transportation and security, as well as wildlife and environmental conservation (Image 1 and 2) for both models. We realised that LDA's results focuses heavily on employment, government, grants and taxes while BERTopic focuses more on education, technology and immigration.
Image 1
Image 2
When analysing the models' performance, BERTopic performed better with a higher coherence score of 0.61276, compared to LDA of 0.48869. This could be because LDA ignores semantic relationships among words due to its Bag-of-Words representation. On the other hand, BERTopic can be fitted to advanced embedding models which are used to cluster sematically similar documents. Another reason for BERTopic's high coherence score could be its use of soft clustering methods where documents that are not clustered into any topic will be grouped under Topic -1. This is not favourable for us as we would like all documents to be under a spceified topic. Human evaluation is needed to better assess the topic modelling performance and coherence.