grimulkan/wikipedia-document-question-answer

Name: grimulkan/wikipedia-document-question-answer
Creator: grimulkan
Published: 2024-01-13 04:10:43
License: 暂无描述

Hugging Face2024-01-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/grimulkan/wikipedia-document-question-answer

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: unknown --- Multi-round questions and answers for randomly selected Wikipedia articles of varying lengths, in fastchat JSON format, generated by `gpt-4-1106-preview`. OpenAI terms apply. This was designed to train a 32K context-length model. Check the total conversation lengths before using data items for training to ensure that they fit inside your target context window, and discard queries that don't fit. - Both the questions and answers were generated by GPT4, based on the document. Only information from the included document in the first prompt was considered (and this was verified using GPT4). - With 25% probability, questions that do not have an answer in the document were asked, to discourage hallucinations. - With 15% probability, the raw article/document was provided followed by a question. Otherwise, some background about the task at hand was included. - Articles were augmented in varivarious random ways (sub-headings removed, bullets removed, citations/background removed, etc.) Only 60 entries are included but they are long and multi-round (whatever I could fit in a budget of ~$1000 in API calls).

提供机构：

grimulkan

原始信息汇总

数据集概述

数据来源与格式

数据集包含多轮问答，基于随机选择的维基百科文章生成，采用fastchat JSON格式。
数据由gpt-4-1106-preview生成，遵循OpenAI的使用条款。

数据设计目的

设计用于训练32K上下文长度的模型。
在使用数据项进行训练前，需检查总对话长度，确保其适合目标上下文窗口，并丢弃不适合的查询。

数据生成规则

问题和答案均由GPT-4生成，仅基于初始提示中包含的文档信息。
有25%的概率生成文档中无答案的问题，以减少幻觉现象。
有15%的概率在问题前提供原始文章或文档，否则提供任务背景信息。
文章以多种随机方式进行增强（如删除子标题、项目符号、引用等）。

数据集规模

数据集包含60个条目，每个条目较长且为多轮对话。
数据集生成预算约为$1000的API调用费用。

5,000+

优质数据集

54 个

任务类型

进入经典数据集