Topic modeling tools in R
using the reticulate
library as an interface to the Python
package BERTopic
.
The package bertopicr
is based on the Python package BERTopic
by Maarten Grootendorst (https://github.com/MaartenGr/BERTopic) and provides tools for performing unsupervised topic modeling. Topic modeling is a method for discovering the abstract "topics" that occur in a collection of documents. This package integrates BERTopic
into R
through the reticulate
package, allowing seamless R-Python interoperability. It includes functions for visualization and analysis of topic modeling results, making it easier to explore topics within text data.
The Python
package BERTopic
is described in the paper:
@article{grootendorst2022bertopic,
title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
author={Grootendorst, Maarten},
journal={arXiv preprint arXiv:2203.05794},
year={2022}
}
To install the package from GitHub
, use the following command in R
:
devtools::install_github("tpetric7/bertopicr")
Ensure that you have the devtools
package installed. If not, you can install it using:
install.packages("devtools")
This package requires a Python
environment with specific packages to run BERTopic
models. You can set up the environment using the following steps:
devtools::install_github("tpetric7/bertopicr")
-
Install Python: Ensure Python is installed on your system. You can download it from python.org. To check if Python is installed, run:
python --version
-
Create a Virtual Environment: It is recommended to create a virtual environment for the required Python packages:
python -m venv r-bertopic
-
Activate the Virtual Environment:
-
On Windows:
r-bertopic\Scripts\activate
-
On macOS and Linux:
source r-bertopic/bin/activate
-
-
Install Required Python Packages:
Clone (or download and unzip) the repository from GitHub:
git clone https://github.com/tpetric7/bertopicr.git
Change the working directory to the
inst
folder inside the clonedbertopicr
repository:cd bertopicr/inst
Use the
requirements.txt
file included in the package to install the necessary Python packages:pip install -r requirements.txt
If your computer has a suitable GPU, it is recommended to install the
CUDA
version ofpytorch
in order to substantially accelerate processing. For Windows (https://pytorch.org/get-started/locally/):pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Make sure to upgrade
pip
if necessary:python -m pip install --upgrade pip
-
Running the Setup Function in R:
Alternatively, after installing
devtools
and thebertopicr
package, run the setup function inR
to install thePython
dependencies:library(reticulate) use_python("path/to/your/Python/env/r-bertopic") library(bertopicr) setup_python_environment()
This function will set up the required Python environment and install all necessary packages.
Download and install ollama (https://ollama.com/) or lm-studio (https://lmstudio.ai/). To install a language model, run the following command in a terminal (e.g., for llama3.1
):
ollama pull llama3.1
If the ollama server does not start automatically, use:
ollama serve
In lm-studio, select a language model from the menu and start the server.
On the spaCy website (https://spacy.io/models), choose a language model for your language. For Slovenian, you can install the following model:
python -m spacy download sl_core_news_md
The reticulate
package allows you to interface with Python from R
. When using functions that rely on Python, you need to load Python modules dynamically within the R functions. For example:
# Example function using reticulate to load BERTopic
#' Run BERTopic on Text Data
#'
#' This function runs BERTopic on a given set of text data and returns the topic model.
#' @param texts A character vector of text documents.
#' @return A BERTopic model object.
#' @export
run_bertopic <- function(texts) {
library(reticulate)
# Use your own Python environment
use_python("path/to/your/python/env/r-bertopic", required = TRUE)
reticulate::py_config()
reticulate::py_available()
# Import necessary Python modules
bertopic <- import("bertopic")
np <- import("numpy")
sentence_transformers <- import("sentence_transformers")
SentenceTransformer <- sentence_transformers$SentenceTransformer
# Embeddings
embedding_model <- SentenceTransformer("BAAI/bge-m3") # for multiple languages
embeddings <- embedding_model$encode(texts, show_progress_bar = TRUE)
# Initialize BERTopic model
topic_model <- bertopic$BERTopic(embedding_model = embedding_model, calculate_probabilities = TRUE)
# Fit the model on the text data
fit_transform <- topic_model$fit_transform(texts, embeddings)
topics <- fit_transform[[1]]
probs <- fit_transform[[2]]
return(list(topics, probs))
}
This example demonstrates how to use reticulate
to load Python modules and perform topic modeling directly from R
.
Once the package and Python environment are set up, you can use the following functions to perform topic modeling and visualize results:
# Example usage
library(bertopicr)
library(dplyr)
# Load sample data
url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv"
chocolate <- readr::read_csv(url)
# Sample text data
texts <- chocolate$most_memorable_characteristics
# Run BERTopic on the text data
topic_model <- run_bertopic(texts)
# Analyze the topic results
topic_results <- tibble(Text = texts, Topic = topic_model[[1]], Probability = apply(topic_model[[2]], 1, max))
# Display the topics
topic_results
The package provides functions for visualizing topics, distributions, and hierarchical structures. Here are some examples:
# Visualize topics
visualize_topics(topic_model)
# Visualize topic distribution
visualize_distribution(topic_model, text_id = 1, probabilities = probs)
# Visualize the hierarchical structure of topics
visualize_hierarchy(topic_model)
You can use custom functions to extract specific information from your topic models. For example, to extract representative documents, to display the temporal development of topics or topic frequency within pre-defined classes or groups:
# Get representative documents
representative_docs <- get_most_representative_docs(df_docs, topic_nr = 3, n_docs = 5)
# Visualize topics over time
visualize_topics_over_time(topic_model, topics_over_time, timestamps)
# Visualize topics per class
visualize_topics_per_class(topic_model, topics_per_class)
For model training, dimension reduction and cluster selection, run the enclosed quarto example file topics_spiegel.qmd.
We welcome contributions! If you would like to contribute to this package, please follow these steps:
- Fork the repository on GitHub.
- Create a new branch for your feature or bug fix.
- Make your changes and test them.
- Submit a pull request with a description of your changes.
Please ensure that your code follows the package's style and guidelines, and that you include tests for any new features.
This package is licensed under the MIT License. You are free to use, modify, and distribute this software, provided that proper attribution is given to the original author.