Nội dung text Tasking Guidelines - Project Shield
We value high quality human data. LLM usage on this project is strictly prohibited. Any usage found will lead to removal from the project. Table of Contents Task and Major Guidelines Code Apps Notes Prompt cannot be rated Rubrics 1. Punt - If a model refuses to answer the prompt, this should be considered a “punt”. If you answered Yes, stop rating. 2. Instruction Following - Did the response follow all instructions given in the prompt? 3. Contextual Awareness - How good is response at using context from the earlier conversation? 4. Content Relevance - Is the content in response relevant to the user's prompt? 5. Content Completeness - How complete is the content provided in the response? 6. Writing Style and Tone - Was the response well-written ( i.e. high-quality, conversational prose that’s engaging and digestible) and stylistically aligned with response guidelines? 7. Collaborativity - How well did the AI Assistant act as a collaborative partner in the response. Please refer to the prompt and any additional conversational context in order to make your assessment. 8. Truthfulness - Is the content in the response factually correct? 9. Information Retrieval - If applicable, how well did the model retrieve the correct information (from documents, files or emails) for the response? 10. Code - Does the code use the most appropriate tool(s), functions(s), and parameter(s) to generate the response? 11. Code Output - Does the code output provide the correct information for the response to fulfill the user intent? 12. Overall Response Quality - How good is the response overall?
IMPORTANT 1. If you can see text in uploads, treat them as an extension of the prompt. Skip any prompts with uploaded files with text that could identify a private individual in any way. 2. Uploads with Images: Skip any prompts with uploaded images of people (e.g. photos with people’s faces). 3. Don’t skip prompts with uploaded images that do not contain people, such as photos of landscapes and pets. Task and Major Guidelines You will see a user prompt and two AI-generated responses, along with the code and API calls for each response. Try to understand the user’s intent using the context from prompt and/or the conversation history, then rate the responses from the user’s perspective. You will rate each response across several dimensions, followed by an overall quality rating. Use the ratings you give across each dimension to determine the overall quality rating. If you’re comparing two responses, you will directly compare the overall quality of the two responses and provide an explanation for your rating. Please consider all of these dimensions when assessing and comparing the overall quality of the responses. Code In addition to the AI-generated responses, you will often see output generated by executing the code. Use this information to contextualize your ratings. 1. Check whether the code includes the right information to address the user's request. ○ Examples: google_flights.search, google_hotels.search_hotels, gemkick_corpus.search, etc 2. Check whether the code uses the correct parameters to address the user's request. Checking the comment in the code and the output of the code execution can help you assess the code accuracy. Don't assess the verbosity or formatting of the 'output' because it's not the output to the user. ○ Taking google_flights.search for example, the function parameters are destination, origin, latest_departure_date, etc Tools available to the model You will find the below tools used in the AI response. [This list doesn’t include all available. More tools may be added in the future as needed.] ● Youtube: search for videos, associated information, and youtube content ● Google_maps: location searches, directions, etc. ● Google_search: search the web and provide relevant results ● Google_flights: search for flights, compare prices, and look at other flight related data (like flight durations) ● Google_hotels: search for hotels and compare prices