AlderleyAI/coqa_chat

Name: AlderleyAI/coqa_chat
Creator: AlderleyAI
Published: 2023-06-28 14:43:13
License: 暂无描述

Hugging Face2023-06-28 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/AlderleyAI/coqa_chat

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - question-answering language: - en size_categories: - 100K<n<1M --- # Dataset Card for CoQA_Chat ## Dataset Description A data set for training LLMs for in-context or Document Question-Answering conversations. - Point of Contact: info@alderley.ai ### Dataset Summary This dataset is an amended version of the CoQA dataset, with the question responses amended to be more conversational in nature, with a greater emphasis on returning contextually relervant infomration with the answer. CoQA is a large-scale dataset for building Conversational Question Answering systems. The goal of the CoQA challenge is to measure the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation. CoQA is pronounced as coca . https://stanfordnlp.github.io/coqa/ ### Supported Tasks In context and Document Question-Answerining ### Languages English Only ## Dataset Structure We provide both csv and jsonl files. ### Data Fields The csv and jsonl datasets has the following attributes: - id: Matches the original CoQA id (string) - local_order : Int associated with the order of the questions for user/assistant chat conversations. (integer) - context: Matches the original CoQA context (string) - question: Matches the original CoQA question (string) - answer: Conversational answer to question. (evolution of original CoQA answer) (string) ### Data Splits The original training and validation dataset have been combined into a single data splie. ## Dataset Creation ### Curation Rationale This data set is specifically to support the training of large language models for in-context question-answering or document question-answering converstations. Small Instruct and Chat trained LLMs struggle with this task and have a tendency to ignore the provided context when generating an output. This data set is designed to support the training of small LLMs that excel at this task. ### Source Data #### Initial Data Collection and Normalization CoQA https://huggingface.co/datasets/coqa https://stanfordnlp.github.io/coqa/ This new answer data set was generated from the original CoQA data set over several days by querying gpt-3.5-turbo with the following prompt... ``` system_intel = """In the dataset provided to you, there are several questions with two corresponding reference text for the answer. Each item in this dataset has an ID, a question, and two reference text answers. Your task is to use this information to create a concise and conversationally natural answer. When writing your response, incorporate the essential elements from the question, reference text and answer, avoiding the use of pronouns. Instead, use the specific name or title of the entity being referred to. If a question can be answered with 'yes' or 'no', begin with that before providing a brief explanation. Do not introduce new information, but do make sure that your response can stand on its own, even without the original question for context. However, strive to keep your answers succinct and avoid excessive context. Each of your answers should be returned as a valid JSON object, with the keys "id" and "answer" surrounded by double quotes (""). If you need to use quotes within your answer, use single quotes ('') to keep the JSON formatting correct. Here are a few examples: For [28960 'What is the official name of Brunei?' /n 'Brunei, officially the Nation of Brunei' ‘Nation of Brunei’], output: {"id" : 28960, "answer" : "The official name of Brunei is the Nation of Brunei."}. For [28961, 'Where is it geographically?' /n 'sovereign state located on the north coast of the island of Borneo in Southeast Asia' ‘Southeast Asia], output: {"id" :28961, "answer": "Brunei is located on the north coast of Borneo in Southeast Asia."}. For [28962, 'What body of water is it by?'/n 'Apart from its coastline with the South China Sea' ‘South China Sea’], output: {"id": 28962, "answer": "Brunei is by the South China Sea."}. For [28963, 'When did Sultan Bolkaih rule?' /n 'Sultan Bolkiah (reigned 1485–1528' ‘1485-1528’], output: {"id": 28963, "answer" : "Sultan Bolkaih ruled from 1485 to 1528."}. For [28964, 'What modern day areas did he rule over?' /n'including modern-day Sarawak and Sabah' ‘Sarawak and Sabah’], output: {"id": 28964, "answer" : "Sultan Bolkaih ruled over modern-day Sarawak and Sabah."}. Strive to keep your answers conversational, direct and to the point. Include only as much context as necessary for the answer to make sense on its own, do not use pronouns. """ prompt = f"Here is the dataset: {item_list}" ``` ## Considerations for Using the Data ### Discussion of Biases Data is in English only There are limited number of responses that reflect true negative responses, where there is insuffiecent information in the context to answer the question. In addition these are not uniformly answered with a specific token set, rather the answer will mention its unknown but will also include question contextual information e.g. As an answer to `"Where did he go afterwards?"` The answer may say `"It's unknown where he went afterwards."` ### Other Known Limitations The nature of the questions in this data set are that they reflect a back and forth converstaion between a user and an assistant, and the user questions include the use of dummy pronouns and pronouns that require prior information from either the previous question or responses. Consequently for best use, it may be that for training, several questions that reflect a back and forth between the user and the agent are included in the training data before the final question is posed. This will encourage the model to learn about pronoun use in questioning. Heres an example: User: 'what is the official name of the Vat?' Assistant: 'The official name of the Vat is Vatican Apostolic Library, although it is more commonly called the Vatican Library or simply the Vat.' User: 'where is it?' Assistant: 'The Vatican Library is located in Vatican City and is the library of the Holy See.' User : 'how many printed books does it contain?' Assistant : ### Contributions Alderley.ai

提供机构：

AlderleyAI

原始信息汇总

数据集卡片：CoQA_Chat

数据集描述

CoQA_Chat 数据集是用于训练大型语言模型（LLMs）进行上下文或文档问答对话的数据集。

数据集概述

该数据集是 CoQA 数据集的修订版本，问题回答部分经过修改，使其更具对话性质，并强调在回答中返回上下文相关信息。

CoQA 是一个大规模数据集，用于构建对话问答系统。CoQA 挑战的目标是衡量机器理解文本段落并回答一系列在对话中出现的相互关联问题的能力。

支持的任务

上下文和文档问答

语言

仅限英语

数据集结构

数据集提供 csv 和 jsonl 文件。

数据字段

csv 和 jsonl 数据集包含以下属性：

id: 匹配原始 CoQA id（字符串）
local_order: 用户/助手聊天对话中问题的顺序（整数）
context: 匹配原始 CoQA 上下文（字符串）
question: 匹配原始 CoQA 问题（字符串）
answer: 问题的对话式回答（原始 CoQA 回答的演变）（字符串）

数据分割

原始的训练和验证数据集已合并为一个数据文件。

数据集创建

策划理由

该数据集专门用于支持大型语言模型进行上下文问答或文档问答对话的训练。小型指令和聊天训练的 LLMs 在这项任务上表现不佳，并且倾向于在生成输出时忽略提供的上下文。该数据集旨在支持擅长这项任务的小型 LLMs 的训练。

源数据

初始数据收集和规范化

CoQA

该新回答数据集是通过在几天内使用 gpt-3.5-turbo 查询原始 CoQA 数据集生成的。

数据使用注意事项

偏见讨论

数据仅限英语存在有限数量的真实负面回答，即上下文中没有足够信息来回答问题。此外，这些回答并非统一使用特定标记集，而是会提到其未知，但也会包含问题上下文信息。例如，对于 "Where did he go afterwards?" 的回答可能是 "Its unknown where he went afterwards."

其他已知限制

该数据集中的问题反映了用户和助手之间的来回对话，用户问题包括需要从先前问题或回答中获取信息的代词和虚拟代词。因此，为了最佳使用，可能需要在最终问题提出之前，在训练数据中包含反映用户和代理之间来回对话的几个问题。这将鼓励模型学习问答中的代词使用。

贡献

Alderley.ai

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集