Skip to content

Tests And Metrics

Ankit Zade edited this page May 17, 2024 · 1 revision

Alignment metrics

  1. Accuracy %
  2. Robustness %
  3. Text quality
    • Language
    • Sentiment
    • Emotion
    • Readability index
    • Perplexity
    • Coherence
    • Conciseness

Adversarial metrics

  1. ASR% (Attack Success Rate) for each adversary type
    • Bias
    • Content moderation
    • Prompt Injection
    • Training data extraction
  2. PII detected.
  3. Toxicity detected.

RAG Metrics

  1. Retrieval
    • Context Precision
    • Context Recall
    • Context Relevance
  2. Generation
    • Faithfulness
    • Answer Relevancy
  3. End to End
    • Answer Semantic Similarity
    • Answer Correctness

A detailed description of the tests is provided below:

Accuracy Tests

Test Name Description

Application Accuracy

It defines the system accuracy calculated based on passed tests for a given set of test data.

Answer Relevancy

It measures how relevant and appropriate the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer.

image

Usage: Document/Text Summarization, Q&A, Conversational Chatbots.

Answer Similarity

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1.

Usage: Document Summarization, Q&A, Conversational Chatbots, Language Translation.

Answer Correctness

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Usage: Document/Text Summarization, Q&A, Conversational Chatbots.

Rouge-L

It is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both. A longer shared sequence should indicate more similarity between the two sequences.

E.g. Remember our reference R and candidate summary C:

·       R: The cat is on the mat. ·       C: The cat and the dog.

The LCS is the 3-gram “the cat the” (remember that the words are not necessarily consecutive), which appears in both R and C.

  • ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in C.
  • ROUGE-L precision = 3/5 = 0.6
  • ROUGE-L recall = 3/6 = 0.5
  • ROUGE-L F1-score = 2 * (precision * recall) / (precision + recall) = 0.55 ROUGE-L F1-score around 0.3 to 0.4 may be considered low. ROUGE-L F1-score around 0.4 to 0.5 may be considered moderate. ROUGE-L F1-score above 0.5 is considered good. 0.4 to 0.5 is average. less than 0.4 is unusable.

Usage: Document/Text Summarization, Language Translation.

BERT Score

The main idea behind BERTScore is to compute a similarity score between the embeddings of the candidate text and the reference text. This score reflects the similarity in the contextual representations of the words, which captures not only lexical but also contextual similarity.

Here's how BERTScore is calculated:

  1. Embedding Extraction: First, BERT embeddings are extracted for both the candidate text and the reference text. These embeddings are representations of the input text in a high-dimensional vector space, capturing both word meaning and context.
  2. Similarity Calculation: Next, the cosine similarity is computed between the candidate embeddings and each of the reference embeddings. This yields a similarity score for each reference.
  3. Precision and Recall Computation: BERTScore computes precision and recall based on the similarity scores obtained in the previous step. Precision measures how many words in the candidate text are semantically similar to the reference text, while recall measures how many words in the reference text are captured by the candidate text.
  4. F1 Score: Finally, the F1 score is calculated from precision and recall. This harmonic mean provides a single measure that balances the trade-off between precision and recall.

The BERTScore is typically calculated at the sentence level and then aggregated across multiple sentences to obtain an overall score for the entire document.

BERTScore has been shown to correlate well with human judgments of text quality and has become increasingly popular as an evaluation metric for various text generation tasks due to its ability to capture both lexical and contextual similarity. Score is higher the better

  • < 0.4 - bad
  • 0.4-0.6 - average
  • 0.6-0.8 - good
  • > 0.8 - excellent

Usage: Document/Text Summarization, Language Translation.

Robustness Tests

Test Name Description

Application Robustness

It defines the system robustness calculated based on passed tests for a given set of test data.

Meta Data/Properties

Meta properties for text data refer to characteristics or attributes of the sentence that are related to its structure, context, or usage. Addition of meta properties e.g. intents, entities, language style and variation help us evaluate the generalization abilities of the solution under test.

Add Negation

Adding negation to a sentence involves perturbing the input to express the opposite or negated meaning of the original statement.

Add Typos

Adding typos to a sentence refers to intentionally introducing errors in the spelling of the text. This can include misspelled words or other mistakes commonly found in written language.

Change Location

Replacing the geo location names in the original input text.

Change Names

Replacing people names provided in the original input text.

Expand Contractions

It focuses on expanding contraction words (where present) that shortened forms of two or more words that are combined by omitting one or more letters and replacing them with an apostrophe. E.g. Can't (cannot), Don't (do not) etc.

Add/Remove Punctuations

It helps us to add or remove punctuations to the original input text.

Add Lexicons

Lexicon refers to the vocabulary or dictionary of words and phrases that are used in a particular language or by a particular person, group, or profession. Using these words help us better contextualize our data to the use case and test the generalization capabilities of the solution as well.

Synonyms

Substitutes synonymous word in the input text based on tags.

Antonyms

Substitutes for antonyms in the input text based on tags.

Add Context

It helps us add suffix/prefix to the input and see how that affects the response. This comes in handy to we create adversarial question from our alignment question when the suffixes/prefixes are added accordingly.

RAG Tests

Test Name Description

Faithfulness

It measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

image

Usage: Document/Text Summarization, Q&A, Conversational Chatbots.

Context Precision

It evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground truth and the contexts, with values ranging between 0 and 1, higher scores indicate better precision.

image

Usage: Document/Text Summarization, Q&A, Conversational Chatbots.

Context Relevancy

It gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy.

image

Usage: Document/Text Summarization, Q&A, Conversational Chatbots.

Context Recall

It measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

image

Usage: Document/Text Summarization, Q&A.

Adversarial Tests

Test Name Description
Content Moderation - Profanity Are the LLM's responses Vulgar / Obscene.
Content Moderation - Toxicity Does the LLM's responses contain Hate Speech, Insults, Harassment.
Content Moderation - Racist Does the LLM's responses contain Racist slurs / Ethnic abuse.
Content Moderation - Sexist Does the LLM's response show Prejudice / Discrimination based on sex/gender.
Bias - Demographic Does the LLM's response show bias for demographic groups.
Bias - Cultural Does the LLM's response show bias for cultural backgrounds.
Bias PII - Gender Does the LLM's response show bias based on Gender of the user.
Bias PII - Sex Does the LLM's response show bias based on Sex and Sexual orientations.
Bias PII - Age Does the LLM's response show bias based on user's age.
Bias PII - Race Doed the LLM's response show bias based on user's Race.
Bias PII - Religion Does the LLM's response show bias based on users Religion.
Bias PII - Citizenship Does the LLM's response show bias based on user's Citizenship.
Bias PII - Disability Does the LLM's response show bias based on user's Disability.
Prompt Injection If extra text or context is added to a language model's input, it might affect the model's output. Does it tilt the generated text in the direction of desired results or change the model's responses, moving it toward particular topics, tones, or styles.
Training Data extraction Does the LLM's response give away parts of its training data when prompts including e.g. special character, line breaks or work repetitions are given as input.

Text Quality and Sentiment Tests

Test Name Description

Coherence

It shows how logical and consistent is the response against the input prompt. Higher score is better.

Conciseness

It shows how brief the answer is for the given input prompt. Higher score is better.

Harmfulness

It shows If the response potentially damaging, offensive, or detrimental in nature. Lower score is better.

Maliciousness

It shows does the response have potential to cause harm to individuals, organizations, or systems.

Flesch-Kincaid Grade (Readability)

The Flesch-Kincaid Grade Level is a readability metric used to assess the complexity of written text in English. It provides an estimate of the educational grade level required to understand the text. The metric is based on two factors: average sentence length (measured in words) and average syllables per word.

Flesch-Kincaid Grade Level is calculated:

  1. Calculate the average number of words per sentence (AWPS) by dividing the total number of words in the text by the total number of sentences. AWPS = Total words / Total sentences
  2. Calculate the average number of syllables per word (ASPW) by dividing the total number of syllables in the text by the total number of words. ASPW = Total syllables / Total words
  3. Use the following formula to calculate the Flesch-Kincaid Grade Level: FKGL = 0.39 * AWPS + 11.8 * ASPW - 15.59

The resulting FKGL score represents the grade level at which the text is readable. For example, a FKGL score of 8.0 indicates that the text is readable by an average eighth-grade student.

Flesch- Reading Ease (Readability)

The Flesch Reading Ease is another readability metric developed by Rudolf Flesch. It provides a numerical score to indicate how easy or difficult a piece of text is to read. The score is based on two factors: the average sentence length (measured in words) and the average number of syllables per word.

The Flesch Reading Ease formula is:

206.835−1.015×Average words per sentence−84.6×Average syllables per word

  • 90-100: Very easy to read. Easily understood by an average 11-year-old student.
  • 80-89: Easy to read. Understandable by a 13- to 15-year-old student.
  • 70-79: Fairly easy to read. Understandable by a 16- to 17-year-old student.
  • 60-69: Standard readability. Understandable by an average 18- to 19-year-old student.
  • 50-59: Fairly difficult to read. Suitable for college graduates.
  • 30-49: Difficult to read. Suitable for college students and professionals.

0-29: Very difficult to read. Best understood by university graduates.

Automated Readability Index

Automated Readability Index (ARI) is a readability test that provides an estimate of the educational grade level required to understand a piece of text. It's like the Flesch-Kincaid Grade Level but uses a slightly different formula.

The formula for ARI is:

_ARI_=4.71 × (wordscharacters )+0.5×(sentenceswords )−21.43

Where:

  • "characters" is the total number of characters in the text.
  • "words" is the total number of words in the text.
  • "sentences" is the total number of sentences in the text.

The resulting ARI score corresponds to a U.S. grade level. For example, an ARI score of 8.0 indicates that the text is readable by an average eighth-grade student.

Language Detection

To ensure coherence in responses, it's essential to detect the language of the input question and generate an output that aligns with the same language. This process helps evaluate whether the model responds in the language of the question or appropriately translates it if a translation use case is in play. This methodology ensures linguistic consistency and enhances the user experience by providing responses in the preferred language of interaction.

Sentiment Analysis

Sentiment analysis involves determining the sentiment or opinion expressed in a text, categorizing it as positive, negative, or neutral. BERT (Bidirectional Encoder Representations from Transformers) can enhance sentiment analysis by leveraging its powerful pre-trained language model, which captures both word meaning and context, thus enabling effective understanding of text sentiment.

  • Positive sentence - Typically expresses favorable or uplifting sentiments. E.g. I had a fantastic day at the beach with my friends.
  • Neutral Sentence - Doesn't convey strong positive or negative emotions. E.g. The sky is blue.
  • Negative Sentence - Typically expresses unfavorable or pessimistic sentiments. E.g. I failed the exam and now I feel miserable.

Emotion Analysis

Emotion analysis is like sentiment analysis, but instead of just focusing on whether the sentence is positive or negative we classify the sentence based on the mood of the sentence such as happy, sad, etc. Here again we use the BERT model to classify each sentence into the classes with a confidence score. The class with the highest confidence score is assigned to the sentence.

The output of this process is the classification of the sentence in terms of sentiment and emotion. This facilitates detection of any harmful intent, or words used in the input, enables understanding of the emotion conveyed in the question, and allows analysis of whether the chatbot's responses are positive regardless of the input sentiment.

Perplexity (WIP)

Perplexity in the context of language models, including LLMs (Large Language Models), refers to a measure of how well the model predicts a sample of text. It is commonly used to evaluate the performance of language models by assessing how accurately they can predict the next word in a sequence of words.