A library designed to compare texts within a pandas DataFrame, highlighting changes and computing similarity ratios. It utilizes difflib.SequenceMatcher to perform detailed comparisons and generate results that can be easily interpreted and displayed in HTML format.
Medium.com - How to compare texts in Pandas
- HTML Output: Generate an HTML representation of the comparison results, which highlights
additions
,deletions
, andmodifications
in the text. - Similarity Assessment: Compute a similarity ratio for each pair of compared texts, allowing quick assessment of text changes.
- Flexible Integration: Designed to work directly with pandas DataFrames, making it easy to integrate into existing data processing pipelines.
Installation:
pip install pandas-text-comparer
Necessary imports:
from IPython import display
import pandas as pd
from pandas_text_comparer import TextComparer
Nice-to-haves (not required). Read more here and here
from pandarallel import pandarallel # multi-core processing
from tqdm.auto import tqdm # progress bar
tqdm.pandas()
pandarallel.initialize(progress_bar=True)
Specify the names of your columns and run the computation.
# A toy GPT-4 generated dataset. Replace with your data
df = pd.read_csv("https://github.com/n-splv/pandas-text-comparer/raw/main/data/demo/review-responses.csv.gz")
comparer = TextComparer(df, column_a="llm_response", column_b="human_response")
comparer.run()
Generate an HTML table. It can be viewed with IPython.display
in Jupyter.
Alternatively, you can write it to a file and open in any web browser.
html = comparer.get_html()
display.HTML(html)
Sort the result by ratio
- difflib.SequenceMatcher's metric of similarity between two texts, on the scale from 0 to 1. Higher values mean
that the texts are more similar.
html = comparer.get_html(sort_by_ratio="desc") # or "asc"
Add any columns from the original data to the HTML by simply passing a slice of
the DataFrame to get_html
method.
columns = ["review_id", "company_name"]
html = comparer.get_html(df[columns])
When you provide any pandas object with an index (i.e. pd.DataFrame, pd.Series or pd.Indes) as an argument to get_html
,
it is also used to filter the rows.
filt = df.company_name == "FitFusion"
# Filter rows & add columns
html = comparer.get_html(df[filt])
# Just filter rows
html = comparer.get_html(df[filt].index)
A comparer stores its results in a DataFrame - comparer.result
. This data can be persisted and used later on to create
a new comparer. This way, you avoid the re-computation:
result_filepath = "data/comparer_result.csv"
comparer.result.to_csv(result_filepath)
# Don't forget to specify the index column
loaded_result = pd.read_csv(result_filepath, index_col=0)
new_comparer = TextComparer.from_result(loaded_result)
Also, if you need to further process your data based on the computed similarities of texts, just grab this column from the result:
df["similarity_ratio"] = comparer.result.ratio