-
Notifications
You must be signed in to change notification settings - Fork 0
Tests And Metrics
- Accuracy %
- Robustness %
- Text quality
- Language
- Sentiment
- Emotion
- Readability index
- Perplexity
- Coherence
- Conciseness
- ASR% (Attack Success Rate) for each adversary type
- Bias
- Content moderation
- Prompt Injection
- Training data extraction
- PII detected.
- Toxicity detected.
- Retrieval
- Context Precision
- Context Recall
- Context Relevance
- Generation
- Faithfulness
- Answer Relevancy
- End to End
- Answer Semantic Similarity
- Answer Correctness
A detailed description of the tests is provided below:
Test Name | Description |
---|---|
Application Accuracy |
It defines the system accuracy calculated based on passed tests for a given set of test data. |
Answer Relevancy |
It measures how relevant and appropriate the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This metric is computed using the question, the context and the answer. Usage: Document/Text Summarization, Q&A, Conversational Chatbots. |
Answer Similarity |
The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. Usage: Document Summarization, Q&A, Conversational Chatbots, Language Translation. |
Answer Correctness |
The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness. Usage: Document/Text Summarization, Q&A, Conversational Chatbots. |
Rouge-L |
It is based on the longest common subsequence (LCS) between our model output and reference, i.e. the longest sequence of words (not necessarily consecutive, but still in order) that is shared between both. A longer shared sequence should indicate more similarity between the two sequences. E.g. Remember our reference R and candidate summary C: · R: The cat is on the mat. · C: The cat and the dog. The LCS is the 3-gram “the cat the” (remember that the words are not necessarily consecutive), which appears in both R and C.
Usage: Document/Text Summarization, Language Translation. |
BERT Score |
The main idea behind BERTScore is to compute a similarity score between the embeddings of the candidate text and the reference text. This score reflects the similarity in the contextual representations of the words, which captures not only lexical but also contextual similarity. Here's how BERTScore is calculated:
The BERTScore is typically calculated at the sentence level and then aggregated across multiple sentences to obtain an overall score for the entire document. BERTScore has been shown to correlate well with human judgments of text quality and has become increasingly popular as an evaluation metric for various text generation tasks due to its ability to capture both lexical and contextual similarity. Score is higher the better
Usage: Document/Text Summarization, Language Translation. |
Test Name | Description |
---|---|
Application Robustness |
It defines the system robustness calculated based on passed tests for a given set of test data. |
Meta Data/Properties |
Meta properties for text data refer to characteristics or attributes of the sentence that are related to its structure, context, or usage. Addition of meta properties e.g. intents, entities, language style and variation help us evaluate the generalization abilities of the solution under test. |
Add Negation |
Adding negation to a sentence involves perturbing the input to express the opposite or negated meaning of the original statement. |
Add Typos |
Adding typos to a sentence refers to intentionally introducing errors in the spelling of the text. This can include misspelled words or other mistakes commonly found in written language. |
Change Location |
Replacing the geo location names in the original input text. |
Change Names |
Replacing people names provided in the original input text. |
Expand Contractions |
It focuses on expanding contraction words (where present) that shortened forms of two or more words that are combined by omitting one or more letters and replacing them with an apostrophe. E.g. Can't (cannot), Don't (do not) etc. |
Add/Remove Punctuations |
It helps us to add or remove punctuations to the original input text. |
Add Lexicons |
Lexicon refers to the vocabulary or dictionary of words and phrases that are used in a particular language or by a particular person, group, or profession. Using these words help us better contextualize our data to the use case and test the generalization capabilities of the solution as well. |
Synonyms |
Substitutes synonymous word in the input text based on tags. |
Antonyms |
Substitutes for antonyms in the input text based on tags. |
Add Context |
It helps us add suffix/prefix to the input and see how that affects the response. This comes in handy to we create adversarial question from our alignment question when the suffixes/prefixes are added accordingly. |
Test Name | Description |
---|---|
Faithfulness |
It measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better. Usage: Document/Text Summarization, Q&A, Conversational Chatbots. |
Context Precision |
It evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This metric is computed using the question, ground truth and the contexts, with values ranging between 0 and 1, higher scores indicate better precision. Usage: Document/Text Summarization, Q&A, Conversational Chatbots. |
Context Relevancy |
It gauges the relevancy of the retrieved context, calculated based on both the question and contexts. The values fall within the range of (0, 1), with higher values indicating better relevancy. Usage: Document/Text Summarization, Q&A, Conversational Chatbots. |
Context Recall |
It measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance. Usage: Document/Text Summarization, Q&A. |
Test Name | Description |
---|---|
Content Moderation - Profanity | Are the LLM's responses Vulgar / Obscene. |
Content Moderation - Toxicity | Does the LLM's responses contain Hate Speech, Insults, Harassment. |
Content Moderation - Racist | Does the LLM's responses contain Racist slurs / Ethnic abuse. |
Content Moderation - Sexist | Does the LLM's response show Prejudice / Discrimination based on sex/gender. |
Bias - Demographic | Does the LLM's response show bias for demographic groups. |
Bias - Cultural | Does the LLM's response show bias for cultural backgrounds. |
Bias PII - Gender | Does the LLM's response show bias based on Gender of the user. |
Bias PII - Sex | Does the LLM's response show bias based on Sex and Sexual orientations. |
Bias PII - Age | Does the LLM's response show bias based on user's age. |
Bias PII - Race | Doed the LLM's response show bias based on user's Race. |
Bias PII - Religion | Does the LLM's response show bias based on users Religion. |
Bias PII - Citizenship | Does the LLM's response show bias based on user's Citizenship. |
Bias PII - Disability | Does the LLM's response show bias based on user's Disability. |
Prompt Injection | If extra text or context is added to a language model's input, it might affect the model's output. Does it tilt the generated text in the direction of desired results or change the model's responses, moving it toward particular topics, tones, or styles. |
Training Data extraction | Does the LLM's response give away parts of its training data when prompts including e.g. special character, line breaks or work repetitions are given as input. |
Test Name | Description |
---|---|
Coherence |
It shows how logical and consistent is the response against the input prompt. Higher score is better. |
Conciseness |
It shows how brief the answer is for the given input prompt. Higher score is better. |
Harmfulness |
It shows If the response potentially damaging, offensive, or detrimental in nature. Lower score is better. |
Maliciousness |
It shows does the response have potential to cause harm to individuals, organizations, or systems. |
Flesch-Kincaid Grade (Readability) |
The Flesch-Kincaid Grade Level is a readability metric used to assess the complexity of written text in English. It provides an estimate of the educational grade level required to understand the text. The metric is based on two factors: average sentence length (measured in words) and average syllables per word. Flesch-Kincaid Grade Level is calculated:
The resulting FKGL score represents the grade level at which the text is readable. For example, a FKGL score of 8.0 indicates that the text is readable by an average eighth-grade student. |
Flesch- Reading Ease (Readability) |
The Flesch Reading Ease is another readability metric developed by Rudolf Flesch. It provides a numerical score to indicate how easy or difficult a piece of text is to read. The score is based on two factors: the average sentence length (measured in words) and the average number of syllables per word. The Flesch Reading Ease formula is: 206.835−1.015×Average words per sentence−84.6×Average syllables per word
0-29: Very difficult to read. Best understood by university graduates. |
Automated Readability Index |
Automated Readability Index (ARI) is a readability test that provides an estimate of the educational grade level required to understand a piece of text. It's like the Flesch-Kincaid Grade Level but uses a slightly different formula. The formula for ARI is: _ARI_=4.71 × (wordscharacters )+0.5×(sentenceswords )−21.43 Where:
The resulting ARI score corresponds to a U.S. grade level. For example, an ARI score of 8.0 indicates that the text is readable by an average eighth-grade student. |
Language Detection |
To ensure coherence in responses, it's essential to detect the language of the input question and generate an output that aligns with the same language. This process helps evaluate whether the model responds in the language of the question or appropriately translates it if a translation use case is in play. This methodology ensures linguistic consistency and enhances the user experience by providing responses in the preferred language of interaction. |
Sentiment Analysis |
Sentiment analysis involves determining the sentiment or opinion expressed in a text, categorizing it as positive, negative, or neutral. BERT (Bidirectional Encoder Representations from Transformers) can enhance sentiment analysis by leveraging its powerful pre-trained language model, which captures both word meaning and context, thus enabling effective understanding of text sentiment.
|
Emotion Analysis |
Emotion analysis is like sentiment analysis, but instead of just focusing on whether the sentence is positive or negative we classify the sentence based on the mood of the sentence such as happy, sad, etc. Here again we use the BERT model to classify each sentence into the classes with a confidence score. The class with the highest confidence score is assigned to the sentence. The output of this process is the classification of the sentence in terms of sentiment and emotion. This facilitates detection of any harmful intent, or words used in the input, enables understanding of the emotion conveyed in the question, and allows analysis of whether the chatbot's responses are positive regardless of the input sentiment. |
Perplexity (WIP) |
Perplexity in the context of language models, including LLMs (Large Language Models), refers to a measure of how well the model predicts a sample of text. It is commonly used to evaluate the performance of language models by assessing how accurately they can predict the next word in a sequence of words. |