GitHub - PreferredAI/ReIntNTM: ReIntNTM: A Tool for extracting better topics from neural topic models

ReIntNTM

Towards Reinterpreting Neural Topic Models via Composite Activations, EMNLP'22

Key Idea

What we consider to be topics (from a neural topic-word distribution) can be combined to form better topics via compositions, and hence, a better interpretation of the same topic-word distribution.

Neural Topic Model-Agnostic Approach

Steps:

Train and obtain a topic-word distribution from a Neural Topic Model
Mining Step to generate a pool of candidate topics (both composite & original)
Solving Step using following formulations optimizing preferred metric(s)
1. Greedy using Heuristics
2. Multi-Dimensional Knapsack Problem (MDKP)
3. Maximum-Weight Budget Independent Set (MWBIS)

While the solving step utilises estimated scores using gensim, we recommend the final evaluation to be conducted on a large reference corpus such as Palmetto.

Requirements

Python >= 3.6

Your choice of solver (either):

using gurobipy directly: installation instructions (license required)

python -m pip install gurobipy

solvers via CVXPY installation instructions
1. with gurobipy solver (see 1.)
2. with GLPK_MI via CVXOPT (no-license)
3. with SCIP, installation instructions (recommended no-license)

conda install cvxpy cvxopt numpy pyscipopt==3.5.0 -c conda-forge

More details on using CVXPY found here

Play data provided

Trained using OCTIS by MIND-Lab, CTM (Bianchi et al. 2021) model with 20 and 50 topics generating a pool of 988 and 1414 composite and component topics .

Code

algo/cvxpy_based.py - CVXPY-based solutions for MWBIS & MDKP
algo/gp_based.py - gurobipy-based solutions for MWBIS & MDKP
algo/normal.py - heuristics-based greedy solution & utility functions

Examples were ran on python 3.6, AMD EPYC 7502 @ 2.50GHz, 512GB RAM

Tutorials/Examples

gp_example.ipynb
1. solver examples using gurobipy directly
2. greedy heuristic examples
3. topic examples from play data
cvxpy_example.ipynb
- solver examples using CVXPY with various solvers
mining_example.ipynb
- demonstration of the complete pipeline

Ethics Statement

We understand that some corpus might produce topics with group of words that might cause offense due to possible sensitiveness regarding politically-charged affairs. This mainly affects the news corpus as they are built on historical events. The use of the reinterpretation process is largely dependent on the corpus that NTM is trained on.

Errata

Typo, does not affect any results.

Definition of NPMI (Equation 2) should be:

$\textrm{NPMI}(\mathcal{T}) = \frac{1}{K} \sum_{t \in \mathcal{T}} \frac{\sum_{n_i \in t}\sum_{\substack{n_j \in t,\ n_j \neq n_i}} npmi(n_i,n_j)}{l(l-1)}$

Instead of:

$\textrm{NPMI}(\mathcal{T}) = \frac{1}{K} \sum_{t \in \mathcal{T}} \frac{\sum_{n_i \in t}\sum_{\substack{n_j \in t,\ n_j \neq n_i}} npmi(n_i,n_j)}{l(l-1)/2}$

Citation

If you find our work helpful, we appreciate a citation!

@inproceedings{lim-lauw-2022-towards,
    title = "Towards Reinterpreting Neural Topic Models via Composite Activations",
    author = "Lim, Jia Peng  and
      Lauw, Hady",
    booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.emnlp-main.242",
    pages = "3688--3703",
    abstract = "Most Neural Topic Models (NTM) use a variational auto-encoder framework producing K topics limited to the size of the encoder{'}s output. These topics are interpreted through the selection of the top activated words via the weights or reconstructed vector of the decoder that are directly connected to each neuron. In this paper, we present a model-free two-stage process to reinterpret NTM and derive further insights on the state of the trained model. Firstly, building on the original information from a trained NTM, we generate a pool of potential candidate {``}composite topics{''} by exploiting possible co-occurrences within the original set of topics, which decouples the strict interpretation of topics from the original NTM. This is followed by a combinatorial formulation to select a final set of composite topics, which we evaluate for coherence and diversity on a large external corpus. Lastly, we employ a user study to derive further insights on the reinterpretation process.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
algo		algo
play_data		play_data
LICENSE		LICENSE
Readme.md		Readme.md
cvpxy_example.ipynb		cvpxy_example.ipynb
gp_example.ipynb		gp_example.ipynb
mining_example.ipynb		mining_example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReIntNTM

Key Idea

Neural Topic Model-Agnostic Approach

Requirements

Play data provided

Code

Tutorials/Examples

Ethics Statement

Errata

Citation

About

Releases

Packages

Languages

License

PreferredAI/ReIntNTM

Folders and files

Latest commit

History

Repository files navigation

ReIntNTM

Key Idea

Neural Topic Model-Agnostic Approach

Requirements

Play data provided

Code

Tutorials/Examples

Ethics Statement

Errata

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages