Content text Factuality Sentence-by-Sentence Evaluation Guidelines - p_f .pdf
[External] [CrC] Factuality Sentence-by-Sentence Evaluation Guidelines Task summary In this task, you will be presented with the response from an AI chatbot to a prompt given by a user. The response will be split into sentences, and the goal is to evaluate the factual accuracy of each of those spans. This is a screenshot of what the task layout looks like with a sample prompt and response in English: The task inputs include:
● The prompt addressed to the AI chatbot. ● The response from the AI chatbot. ● A target span (usually a single sentence) to be evaluated. ○ The span can contain multiple claims, e.g., the highlighted span in the example above However, he soon became interested in computer science and information retrieval, and his PhD thesis was on "The Design and Analysis of a Database for Semantic Text Retrieval" is claiming 1. Broder became interested in computer science 2. Broder became interested in information retrieval 3. Broder's PhD thesis was on "The Design and Analysis of a Database for Semantic Text Retrieval" ○ The span should be interpreted in the context of the response, e.g. the first list item in the example below needs to be interpreted as "Cervantes wrote Don Quixote in 1605" and the second one as "Cervantes wrote La Gitanilla in 1613": Annotation flow For each of the highlighted spans in the chatbot response, we need your input to understand (1) whether the span makes any falsifiable factual claims and --if it does-- (2) how factually accurate those claims are. To that end, we need you to do some Internet research to find URLs with information that either supports or contradicts the claim(s) in each sentence. Sentences containing factual information The first thing to determine is whether the span to be evaluated includes falsifiable factual claims, in order to answer the first rating question: "Does this sentence attempt to convey any factual information?". That is the case when: ● The span contains specific factual information ○ e.g. "Tripping over pets causes more than 86,000 falls each year in the United
States that are serious enough to result in a trip to the emergency room." ● The span makes general claims about the world or contains common knowledge ○ e.g. "Most major cities have activities and entertainment for people of all ages." ● The span contains evaluative statements ○ e.g. "Rambutan is a delicious fruit." ● The span contains advice and instructions, e.g., "To make pour over coffee, first measure out roughly 1g of coffee per 16ml of water." On the contrary, it doesn't contain factual information if: ● The span is purely conversational ○ e.g., "Good night." or "Do you like a good deal?" ● The span is a generic disclaimer ○ e.g., "Speak to a doctor before making decisions." ● The span is a first-person opinion ○ e.g., "Sorry, I don't know about that topic." ● The span contains phrases that structure the response, including phrases that introduce a list, summaries of other parts of the response, or conclusions that follow from other parts of the response ○ e.g., for the response "Benefits of coffee include: 1) Brain function: Coffee can improve ...", the span "Benefits of coffee include:" should be rated as not requiring attribution. ● The span rephrases the user's input for confirmation ○ e.g. "Sure, here are some ideas for a kid’s fourth birthday party:" ● The span contains "Additional resources" links - no need to verify whether the URLs exist or contain factual info. ● The span only contains fictional or creative content that the system made up ○ e.g., "Once upon a time, there was a cowboy named Jack." NOTE: If, fictional content is mixed with factual content, or referred to in spans mainly containing factual information, consider the span as conveying factual information, e.g.: ● there is a mix of creative content and factual assertions: "Jack time-traveled back to 1863 when Lincoln gave the Gettysburg Address." ● the span contains non-fictional facts written in artistic styles, e.g., generated poems about Abraham Lincoln's achievements. ● the span contains verifiable facts about established fictional works, e.g., plot points in Lord of the Rings. ● Also check Section Corner cases for edge case guidance, as well as for sentences that can be skipped. If the span is deemed to contain factual information, answer "Yes" to the first question and move on to the next fields; if it doesn't contain factual information, select "No" and skip the rest of the annotations and click on Submit.
Accuracy ratings If the span is found to contain factual information, rate the accuracy of said factual information as follows: ● "Inaccurate": At least one claim in the span is factually incorrect. ○ Use this label when there is any contradicting URL and no supporting URLs for that claim. ○ Some claims contradict common knowledge (e.g. "white is a dark color") or are simple to disprove (e.g., 'dream' is a 4 letter word). In such cases, rate as "Inaccurate" even when there are no contradicting URLs. ○ When selecting "Inaccurate", an additional question to assess the severity of the inaccuracy will pop up, see below for guidance. ● "Unsupported": At least one claim in the span is likely made up. If you have spent a reasonable amount of time (e.g., 10 minutes) researching and found no supporting or contradicting URLs, and the claim is not clearly correct or incorrect, use this label. ○ Use the Comments field to briefly summarize the research process. ○ ⚠️ IMPORTANT: If no information is found on the web that supports a claim, but you would expect that information to be easily found, label the sentence as Inaccurate rather than Unsupported. ■ Imagine the chatbot response contains a sentence like "Steven Spielberg's latest movie is My life is better with you in it". When you search the web you don't find anything that supports or contradicts the claim, but you would expect Spielberg's latest movie to be found in e.g. Wikipedia or IMDB. Not finding any evidence in this case is actually evidence that the sentence is Inaccurate. ■ Imagine on the other hand a sentence like "The captain of the basketball team in Matalascañas (Huelva, Spain) is 30 years old". After a few minutes of research you found nothing about the captain of this basketball team, so you can't conclude whether the claim is accurate or inaccurate (as opposed to Spielberg's latest movie, you don't expect to find information about the captain of a basketball team from a little Spanish town), so you label it as Unsupported. ● "Disputed" (e.g., controversial / mixed opinion / pros and cons): At least one claim has both supporting and contradicting evidence that cannot be reconciled and does not have a universal consensus. ○ Add both supporting and contradicting URLs in the fields below. ○ Important: Not all subjective claims and opinions should be labeled as "Disputed". If the statement is agreed upon by virtually all experts ("the earth is round"), mark the claim as "Accurate." Similarly, if the statement is broadly disagreed, mark it as "Inaccurate." ○ The main difference between "Unsupported" and "Disputed" is that for "Disputed"