Tests And Metrics

Alignment metrics

Accuracy %
Robustness %
Text quality
- Language
- Sentiment
- Emotion
- Readability index
- Perplexity
- Coherence
- Conciseness

Adversarial metrics

ASR% (Attack Success Rate) for each adversary type
- Bias
- Content moderation
- Prompt Injection
- Training data extraction
PII detected.
Toxicity detected.

RAG Metrics

Retrieval
- Context Precision
- Context Recall
- Context Relevance
Generation
- Faithfulness
- Answer Relevancy
End to End
- Answer Semantic Similarity
- Answer Correctness

A detailed description of the tests is provided below:

Accuracy Tests

Test Name	Description
Application Accuracy	It defines the system accuracy calculated based on passed tests for a given set of test data.
Answer Relevancy	It measures how relevant and appropriate the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer. Usage: Document/Text Summarization, Q&A, Conversational Chatbots.
Answer Similarity	The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. Usage: Document Summarization, Q&A, Conversational Chatbots, Language Translation.
Answer Correctness	The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Usage: Document/Text Summarization, Q&A, Conversational Chatbots.
Rouge-L	It is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both. A longer shared sequence should indicate more similarity between the two sequences. E.g. Remember our reference R and candidate summary C: · R: The cat is on the mat. · C: The cat and the dog. The LCS is the 3-gram “the cat the” (remember that the words are not necessarily consecutive), which appears in both R and C. ROUGE-L precision is the ratio of the length of the LCS, over the number of unigrams in C. ROUGE-L precision = 3/5 = 0.6 ROUGE-L recall = 3/6 = 0.5 ROUGE-L F1-score = 2 * (precision * recall) / (precision + recall) = 0.55 ROUGE-L F1-score around 0.3 to 0.4 may be considered low. ROUGE-L F1-score around 0.4 to 0.5 may be considered moderate. ROUGE-L F1-score above 0.5 is considered good. 0.4 to 0.5 is average. less than 0.4 is unusable. Usage: Document/Text Summarization, Language Translation.
BERT Score	The main idea behind BERTScore is to compute a similarity score between the embeddings of the candidate text and the reference text. This score reflects the similarity in the contextual representations of the words, which captures not only lexical but also contextual similarity. Here's how BERTScore is calculated: Embedding Extraction: First, BERT embeddings are extracted for both the candidate text and the reference text. These embeddings are representations of the input text in a high-dimensional vector space, capturing both word meaning and context. Similarity Calculation: Next, the cosine similarity is computed between the candidate embeddings and each of the reference embeddings. This yields a similarity score for each reference. Precision and Recall Computation: BERTScore computes precision and recall based on the similarity scores obtained in the previous step. Precision measures how many words in the candidate text are semantically similar to the reference text, while recall measures how many words in the reference text are captured by the candidate text. F1 Score: Finally, the F1 score is calculated from precision and recall. This harmonic mean provides a single measure that balances the trade-off between precision and recall. The BERTScore is typically calculated at the sentence level and then aggregated across multiple sentences to obtain an overall score for the entire document. BERTScore has been shown to correlate well with human judgments of text quality and has become increasingly popular as an evaluation metric for various text generation tasks due to its ability to capture both lexical and contextual similarity. Score is higher the better < 0.4 - bad 0.4-0.6 - average 0.6-0.8 - good > 0.8 - excellent Usage: Document/Text Summarization, Language Translation.

Robustness Tests

Test Name	Description
Application Robustness	It defines the system robustness calculated based on passed tests for a given set of test data.
Meta Data/Properties	Meta properties for text data refer to characteristics or attributes of the sentence that are related to its structure, context, or usage. Addition of meta properties e.g. intents, entities, language style and variation help us evaluate the generalization abilities of the solution under test.
Add Negation	Adding negation to a sentence involves perturbing the input to express the opposite or negated meaning of the original statement.
Add Typos	Adding typos to a sentence refers to intentionally introducing errors in the spelling of the text. This can include misspelled words or other mistakes commonly found in written language.
Change Location	Replacing the geo location names in the original input text.
Change Names	Replacing people names provided in the original input text.
Expand Contractions	It focuses on expanding contraction words (where present) that shortened forms of two or more words that are combined by omitting one or more letters and replacing them with an apostrophe. E.g. Can't (cannot), Don't (do not) etc.
Add/Remove Punctuations	It helps us to add or remove punctuations to the original input text.
Add Lexicons	Lexicon refers to the vocabulary or dictionary of words and phrases that are used in a particular language or by a particular person, group, or profession. Using these words help us better contextualize our data to the use case and test the generalization capabilities of the solution as well.
Synonyms	Substitutes synonymous word in the input text based on tags.
Antonyms	Substitutes for antonyms in the input text based on tags.
Add Context	It helps us add suffix/prefix to the input and see how that affects the response. This comes in handy to we create adversarial question from our alignment question when the suffixes/prefixes are added accordingly.

RAG Tests

Test Name	Description
Faithfulness	It measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better. Usage: Document/Text Summarization, Q&A, Conversational Chatbots.
Context Precision	It evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground truth and the contexts, with values ranging between 0 and 1, higher scores indicate better precision. Usage: Document/Text Summarization, Q&A, Conversational Chatbots.
Context Relevancy	It gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. Usage: Document/Text Summarization, Q&A, Conversational Chatbots.
Context Recall	It measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance. Usage: Document/Text Summarization, Q&A.

Adversarial Tests

Test Name	Description
Content Moderation - Profanity	Are the LLM's responses Vulgar / Obscene.
Content Moderation - Toxicity	Does the LLM's responses contain Hate Speech, Insults, Harassment.
Content Moderation - Racist	Does the LLM's responses contain Racist slurs / Ethnic abuse.
Content Moderation - Sexist	Does the LLM's response show Prejudice / Discrimination based on sex/gender.
Bias - Demographic	Does the LLM's response show bias for demographic groups.
Bias - Cultural	Does the LLM's response show bias for cultural backgrounds.
Bias PII - Gender	Does the LLM's response show bias based on Gender of the user.
Bias PII - Sex	Does the LLM's response show bias based on Sex and Sexual orientations.
Bias PII - Age	Does the LLM's response show bias based on user's age.
Bias PII - Race	Doed the LLM's response show bias based on user's Race.
Bias PII - Religion	Does the LLM's response show bias based on users Religion.
Bias PII - Citizenship	Does the LLM's response show bias based on user's Citizenship.
Bias PII - Disability	Does the LLM's response show bias based on user's Disability.
Prompt Injection	If extra text or context is added to a language model's input, it might affect the model's output. Does it tilt the generated text in the direction of desired results or change the model's responses, moving it toward particular topics, tones, or styles.
Training Data extraction	Does the LLM's response give away parts of its training data when prompts including e.g. special character, line breaks or work repetitions are given as input.

Text Quality and Sentiment Tests

Test Name	Description
Coherence	It shows how logical and consistent is the response against the input prompt. Higher score is better.
Conciseness	It shows how brief the answer is for the given input prompt. Higher score is better.
Harmfulness	It shows If the response potentially damaging, offensive, or detrimental in nature. Lower score is better.
Maliciousness	It shows does the response have potential to cause harm to individuals, organizations, or systems.
Flesch-Kincaid Grade (Readability)	The Flesch-Kincaid Grade Level is a readability metric used to assess the complexity of written text in English. It provides an estimate of the educational grade level required to understand the text. The metric is based on two factors: average sentence length (measured in words) and average syllables per word. Flesch-Kincaid Grade Level is calculated: Calculate the average number of words per sentence (AWPS) by dividing the total number of words in the text by the total number of sentences. AWPS = Total words / Total sentences Calculate the average number of syllables per word (ASPW) by dividing the total number of syllables in the text by the total number of words. ASPW = Total syllables / Total words Use the following formula to calculate the Flesch-Kincaid Grade Level: FKGL = 0.39 * AWPS + 11.8 * ASPW - 15.59 The resulting FKGL score represents the grade level at which the text is readable. For example, a FKGL score of 8.0 indicates that the text is readable by an average eighth-grade student.
Flesch- Reading Ease (Readability)	The Flesch Reading Ease is another readability metric developed by Rudolf Flesch. It provides a numerical score to indicate how easy or difficult a piece of text is to read. The score is based on two factors: the average sentence length (measured in words) and the average number of syllables per word. The Flesch Reading Ease formula is: 206.835−1.015×Average words per sentence−84.6×Average syllables per word 90-100: Very easy to read. Easily understood by an average 11-year-old student. 80-89: Easy to read. Understandable by a 13- to 15-year-old student. 70-79: Fairly easy to read. Understandable by a 16- to 17-year-old student. 60-69: Standard readability. Understandable by an average 18- to 19-year-old student. 50-59: Fairly difficult to read. Suitable for college graduates. 30-49: Difficult to read. Suitable for college students and professionals. 0-29: Very difficult to read. Best understood by university graduates.
Automated Readability Index	Automated Readability Index (ARI) is a readability test that provides an estimate of the educational grade level required to understand a piece of text. It's like the Flesch-Kincaid Grade Level but uses a slightly different formula. The formula for ARI is: _ARI_=4.71 × (wordscharacters )+0.5×(sentenceswords )−21.43 Where: "characters" is the total number of characters in the text. "words" is the total number of words in the text. "sentences" is the total number of sentences in the text. The resulting ARI score corresponds to a U.S. grade level. For example, an ARI score of 8.0 indicates that the text is readable by an average eighth-grade student.
Language Detection	To ensure coherence in responses, it's essential to detect the language of the input question and generate an output that aligns with the same language. This process helps evaluate whether the model responds in the language of the question or appropriately translates it if a translation use case is in play. This methodology ensures linguistic consistency and enhances the user experience by providing responses in the preferred language of interaction.
Sentiment Analysis	Sentiment analysis involves determining the sentiment or opinion expressed in a text, categorizing it as positive, negative, or neutral. BERT (Bidirectional Encoder Representations from Transformers) can enhance sentiment analysis by leveraging its powerful pre-trained language model, which captures both word meaning and context, thus enabling effective understanding of text sentiment. Positive sentence - Typically expresses favorable or uplifting sentiments. E.g. I had a fantastic day at the beach with my friends. Neutral Sentence - Doesn't convey strong positive or negative emotions. E.g. The sky is blue. Negative Sentence - Typically expresses unfavorable or pessimistic sentiments. E.g. I failed the exam and now I feel miserable.
Emotion Analysis	Emotion analysis is like sentiment analysis, but instead of just focusing on whether the sentence is positive or negative we classify the sentence based on the mood of the sentence such as happy, sad, etc. Here again we use the BERT model to classify each sentence into the classes with a confidence score. The class with the highest confidence score is assigned to the sentence. The output of this process is the classification of the sentence in terms of sentiment and emotion. This facilitates detection of any harmful intent, or words used in the input, enables understanding of the emotion conveyed in the question, and allows analysis of whether the chatbot's responses are positive regardless of the input sentiment.
Perplexity (WIP)	Perplexity in the context of language models, including LLMs (Large Language Models), refers to a measure of how well the model predicts a sample of text. It is commonly used to evaluate the performance of language models by assessing how accurately they can predict the next word in a sequence of words.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests And Metrics

Alignment metrics

Adversarial metrics

RAG Metrics

Accuracy Tests

Robustness Tests

RAG Tests

Adversarial Tests

Text Quality and Sentiment Tests

Clone this wiki locally