Skip to content

Generating synthetic reviews from real reviews by fine tuning pretrained GPT 2 117M model, followed by few-shot prompting, finally evaluating by BERT classifier.

Notifications You must be signed in to change notification settings

abhroroy365/Synethic-Review-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Synthetic Review Generation

Problem

A common challenge in AI tasks like intent detection, sentiment analysis, slot filling, or recommendation algorithms is the limited availability of datasets for model training. The objective of this project is to create synthetic datasets which can mimic realistic human-written text to aid in training, designing, and evaluating these AI systems.
Given a large data of real ecommerce reviews, we need to generate synthetic reviews.

Dataset

Amazon reviews dataset: https://amazon-reviews-2023.github.io/main.html For fine tuning the model a subset (“Supplements/Vitamins”) of the data has been used. Download link https://drive.google.com/file/d/1o9IvevRbxKagdE-Op1BJRl8iJUc-0kJ4/view?usp=sharing

Approach

Pre-trained GPT-2 117M has been fine-tuned on the real reviews.
• Number of epochs = 30, time taken: 2 hrs 10 mins (on NVIDIA Tesla P100 GPU)
• Few-shot Learning was also used. For every product, we take 2 sample data of the same product title from train data, then make use of it to perform 2-shot prompting.

Evaluation

• Using a classifier to check it it detects synthetic data or not. After generating the synthetic data, we mix it with real data and add label to both real and synthetic data (real = 0, synthetic=1). We train a BERT model to classify them. The results came out that BERT was able to classify them quite well. Thus the synthetic texts were identifiable.
• We check the term frequency distribution of both real data and synthetic data. If we check top 10 frequent terms, their distribution is quite similar, not that much though.
image

• We also check the semantic similarity. The mean semantic similarity score for real – synthetic is 0.1878. It suggests that synthetic reviews are somewhat similar to real reviews, yet not that much. image

We need to work more on our synthetic data preparation strategy

Fine Tuned Model

Fine tuned GPT-2 117M model can be downloaded from the link below. https://drive.google.com/file/d/1aNz5GdHRR1fPKhs5pAHxQzqkQlYy88_u/view?usp=sharing

About

Generating synthetic reviews from real reviews by fine tuning pretrained GPT 2 117M model, followed by few-shot prompting, finally evaluating by BERT classifier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published