Enhancing Editorial Tasks: A Case Study on Rewriting Customer Help Page Contents Using Large Language Models
We introduce a German-language dataset comprising Frequently Asked Question-Answer pairs: raw FAQ drafts, their revisions by professional editors and LLM generated revisions. The data was used to investigate the use of large language models (LLMs) to enhance the editorial process of rewriting customer help pages.
The input data was provided by Deutsche Telekom AG (DT), a large German telecommunications company. The corpus comprises 56 question-answer pairs addressing potential customer inquiries across various topics, including additional SIM cards, Netflix subscriptions, relocation, changing mobile service providers, house connection orders, hardware order and delivery status, and fixed-line internet and TV setup. For each FAQ pair, a raw input is provided by specialized departments, and a rewritten gold output is crafted by a professional editor of DT. The final dataset also includes LLM generated FAQ-pairs.
On this dataset, we evaluate the performance of four large language models (LLM) through diverse prompts tailored for the rewriting task. We conduct automatic evaluations of content and text quality using ROUGE, BERTScore, and ChatGPT. Furthermore, we let professional editors assess the helpfulness of automatically generated FAQ revisions for editorial enhancement. Our findings indicate that LLMs can produce FAQ reformulations beneficial to the editorial process. We observe minimal performance discrepancies among LLMs for this task, and our survey on helpfulness underscores the subjective nature of editors' perspectives on editorial refinement.
For detailed results, please see our paper accepted at INLG 20204, Tokyo, Japan (see also Citation).
The data is provided in json format.
The folder data/faq-data contains raw FAQ drafts (input field), revisions by professional editors (reference field), and up to 3 LLM-generated revisions for each input (predictions field). The json also contains scores for each prediction (BERTScore, Rouge and ChatGPT-4 Scores wrt. hallucination, informativeness and coherence. At the end of the files there are also overall average scores for each metric.
instances
: list of evaluated instances.prompt_id
: astring
feature specifying the LLM, prompt, and task type.description
: astring
feature.prompt_type
: astring
feature.system_prompts
: astring
feature listing the generic system prompt.user_prompts
: the list of user prompts, one per instance.requests
: a list of LLM request parameters, one per instance.responses
: the list of raw LLM responses plus metadata, one per instance.evaluation_overall
: a list of used metrics along with the mean of the score values across all LLM-generated revisions contained in the json file.
Each instance
contains the following fields:
instance_id
: instance identifier, astring
feature.input
: the draft / input texts of an FAQ entry, consisting of a referenceurl
field (astring
feature), aquestion
(string
feature) and ananswer
(string
feature).predictions
: a list of up to 3 predictions (generated rewrites) of the input generated by the LLM. Each prediction consists of anidx
(int
feature) field, plus a generatedquestion
(string
feature) andanswer
(string
feature). Each prediction also contains a list ofevaluation
scores (content overlap metrics likeROUGE
, and GPT4-based evaluations such ashallucinations
).reference
: the human-written reference FAQ, consisting of aquestion
(string
feature) andanswer
(string
feature), and aurl
(string
feature) reference.use_case
: astring
feature describing the general topic of the FAQ instance.
Note that the reference URLs may have changed, or do not contain the exact same content that we used for this dataset, as they are continuously being updated to reflect new information.
The folder data/prompts contains an overview of the prompts.
If you use the dataset, please cite the following paper:
@inproceedings{gabryszak-etal-2024-enhancing-editorial,
title = "Enhancing Editorial Tasks: A Case Study on Rewriting Customer Help Page Contents Using Large Language Models",
author = {Gabryszak, Aleksandra and
R{\"o}der, Daniel and
Binder, Arne and
Sion, Luca and
Hennig, Leonhard},
editor = "Mahamood, Saad and
Minh, Nguyen Le and
Ippolito, Daphne",
booktitle = "Proceedings of the 17th International Natural Language Generation Conference",
month = sep,
year = "2024",
address = "Tokyo, Japan",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.inlg-main.33",
pages = "402--411",
abstract = "In this paper, we investigate the use of large language models (LLMs) to enhance the editorial process of rewriting customer help pages. We introduce a German-language dataset comprising Frequently Asked Question-Answer pairs, presenting both raw drafts and their revisions by professional editors. On this dataset, we evaluate the performance of four large language models (LLM) through diverse prompts tailored for the rewriting task. We conduct automatic evaluations of content and text quality using ROUGE, BERTScore, and ChatGPT. Furthermore, we let professional editors assess the helpfulness of automatically generated FAQ revisions for editorial enhancement. Our findings indicate that LLMs can produce FAQ reformulations beneficial to the editorial process. We observe minimal performance discrepancies among LLMs for this task, and our survey on helpfulness underscores the subjective nature of editors{'} perspectives on editorial refinement.",
}
The data is released under the terms of the CC-BY-SA-4.0.