five

GenQA

收藏
魔搭社区2026-05-09 更新2024-06-25 收录
下载链接:
https://modelscope.cn/datasets/swift/GenQA
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for GenQA <!--Can we put the image here? --> ![image](genqa_production_facility.jpg) GenQA consists of over 10M cleaned and deduplicated instruction samples generated from a handful of carefully designed prompts. Our experiments indicate that prompt engineering alone can yield millions of diverse training samples with quality as good as (or in some cases surpassing) high-cost human labelling. To the best of our knowledge, this is the largest finetuning dataset written autonomously by an LLM without conditioning on human questions or using complex multi-stage pipelines. This dataset is also valuable to combine with other existing datasets, or to quickly generate synthetic data to cover blindspots of existing data sources. ## Accessing the dataset Each split can be automatically downloaded and loaded for browsing, processing, or training in a few lines of code using the [`datasets`](https://huggingface.co/docs/datasets/en/process) library. Here is an example of downloading the dataset and then selecting the `"academic"` split. Optionally, pass the split argument to `load_dataset` directly. ``` >>> from datasets import load_dataset >>> ds = load_dataset("tomg-group-umd/GenQA") >>> ds DatasetDict({ code: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 513483 }) dialog: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 819154 }) general: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 304920 }) math: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 515509 }) mmlu: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 2409841 }) multiple_choice: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 372610 }) writing: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 932362 }) academic: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 4210076 }) task: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 1004179 }) }) >>> ds["academic"] Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 4210076 }) ``` ### Chat templating The dataset contains both single turn and multi-turn conversations in many of the splits. For ease of use, each conversation (contained in the `text` field) is formatted in the same way as a list of json objects. This format is chosen for compatibility with the [`transformers`](https://github.com/huggingface/transformers) library's built in support for [chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating). Here is an example from the `dialog` split: ``` >>> ds["dialog"][123]["text"] [{'content': 'What were the Opium Wars?', 'role': 'user'}, {'content': 'The Opium Wars were a series of conflicts in the 19th century between the British Empire and the Qing dynasty of China over the British desire to sell opium to China. The First Opium War (1839-1842) ended with the Treaty of Nanking, which gave the British control over Hong Kong and opened five Chinese ports to foreign trade. The Second Opium War (1856-1860) ended with the Treaty of Tientsin, which gave the British and French further concessions in China, including the right to station troops in Beijing. The Opium Wars had a profound impact on China, weakening the Qing dynasty and leading to the rise of foreign imperialism in the country.', 'role': 'assistant'}, {'content': 'What role did opium play in the wars?', 'role': 'user'}, {'content': "Opium played a central role in the Opium Wars. The British East India Company had been selling opium to China since the late 18th century, and by the early 19th century, opium had become one of China's most valuable imports. However, the Qing government was concerned about the growing opium addiction problem in the country and in 1839, it banned the import of opium. The British government, which was heavily dependent on the opium trade for revenue, refused to accept the ban, and this led to the outbreak of the First Opium War.", 'role': 'assistant'}, {'content': 'How did the wars affect China?', 'role': 'user'}, {'content': 'The Opium Wars had a devastating impact on China. The wars weakened the Qing dynasty and led to the rise of foreign imperialism in the country. The Treaty of Nanking gave the British control over Hong Kong and opened five Chinese ports to foreign trade. This led to an influx of foreign goods and ideas into China, which had a profound impact on Chinese society. The wars also led to the loss of Chinese territory, as the British acquired Hong Kong and other territories in Southeast Asia.', 'role': 'assistant'}, {'content': 'What were the long-term consequences of the Opium Wars?', 'role': 'user'}, {'content': 'The Opium Wars had a number of long-term consequences for China. The wars weakened the Qing dynasty and led to the rise of foreign imperialism in the country. The wars also led to the loss of Chinese territory, as the British acquired Hong Kong and other territories in Southeast Asia. The wars also had a significant impact on Chinese society, as they led to an influx of foreign goods and ideas into the country. This led to a number of changes in Chinese society, including the growth of nationalism and the decline of traditional Chinese values.', 'role': 'assistant'}] ``` ## Dataset Details ### Dataset Description The associated manuscript contains a detailed explanation of each split. Each split pertains to a different range of topics. Each instance within a split contains simulated conversations between a human user and an LLM virtual assistant. Many splits contain instances with multiple turns in which the human and machine alternate questions and responses. Each sample contains Unicode formatted text representing a conversation between a user and an assistant. A unified format is used for both single and multi-turn conversations. - **Paper:** [GenQA: Generating Millions of Instructions from a Handful of Prompts (2406.10323)](https://arxiv.org/abs/2406.10323)] - **Curated by:** Researchers from the University of Maryland, College Park. - **Language(s) (NLP):** English. - **License:** The dataset is available under the Creative Commons NonCommercial (CC BY-NC 4.0). ### Uses <!-- Address questions around how the dataset is intended to be used. --> The dataset is intended for researchers and individuals in the academic community focused on solving problems related to large language model (LLM) finetuning at scale. We hope this enables the performance of controlled scientific studies that result in better science and ultimately new techniques to address the existing deficiencies with existing models. ### Direct Use <!-- This section describes suitable use cases for the dataset. --> This dataset is suitable for use cases involving open research on industrial-scale finetuning practices, leveraging its scale and diversity to advance knowledge in this area. ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> The dataset should not be used for applications requiring verified factual accuracy, critical decision-making, or any malicious or unethical activities. ## Dataset Structure <!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. --> Detailed information on how each split was curated is provided in the manuscript and paper. Dataset fields: - text: contains actual text for training (with input and output) - prompt: specifies the requirement of instruction generation - template: optional description of the type of prompt template used for instruction generated. - category: separates the Code data into coding, library and markdown. ## Dataset Creation ### Curation Rationale GenQA was created to demonstrate the effectiveness of autonomously written instruction datasets, and to produce a research instruction dataset of size comparable to commercial instruction sets ### Source Data <!-- This section describes the source data (e.g. news text and headlines, social media posts, translated sentences, ...). --> The data consists of question and answer pairs simulating conversation between a user and an assistant. #### Data Collection and Processing <!-- This section describes the data collection and processing process such as data selection criteria, filtering and normalization methods, tools and libraries used, etc. --> Questions were written by the Gemini Language model and deduplicated in the processing stage. All collection and processing stages were done using Python. #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> Correspondence to Tom Goldstein (tomg@umd.edu) ### Annotations [optional] <!-- If the dataset contains annotations which are not part of the initial data collection, use this section to describe them. --> N/A. #### Annotation process <!-- This section describes the annotation process such as annotation tools used in the process, the amount of data annotated, annotation guidelines provided to the annotators, interannotator statistics, annotation validation, etc. --> N/A. #### Who are the annotators? <!-- This section describes the people or systems who created the annotations. --> N/A. #### Personal and Sensitive Information <!-- State whether the dataset contains data that might be considered personal, sensitive, or private (e.g., data that reveals addresses, uniquely identifiable names or aliases, racial or ethnic origins, sexual orientations, religious beliefs, political opinions, financial or health data, etc.). If efforts were made to anonymize the data, describe the anonymization process. --> N/A. ## Bias, Risks, and Limitations <!-- This section is meant to convey both technical and sociotechnical limitations. --> The dataset may contain factual inaccuracies and was not manually checked for factual correctness. ### Recommendations <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> Users should be aware of the potential factual inaccuracies and exercise caution when using the dataset by verifying critical information and mitigatating any potential biases and errors. ## Citation [optional] <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] <!-- If relevant, include terms and calculations in this section that can help readers understand the dataset or dataset card. --> [More Information Needed] ## More Information [optional] [More Information Needed] ## Dataset Card Authors [optional] [More Information Needed] ## Dataset Card Contact [More Information Needed]

# GenQA 数据集卡片 <!--Can we put the image here? --> ![image](genqa_production_facility.jpg) GenQA 包含超过1000万条经过清洗与去重的指令样本,这些样本由少量精心设计的提示词(prompt)生成。我们的实验表明,仅依靠提示词工程(prompt engineering)即可生成数百万条多样化的训练样本,其质量可媲美(在部分场景下甚至超越)高成本的人工标注样本。据我们所知,这是首个由大语言模型(LLM)自主生成的大规模微调数据集,且无需依赖人类提问或复杂的多阶段流水线。该数据集还可与其他现有数据集结合使用,或用于快速生成合成数据以填补现有数据源的空白。 ## 数据集获取 每个拆分子集均可通过 [`datasets`](https://huggingface.co/docs/datasets/en/process) 库,仅用数行代码即可自动下载、加载,用于浏览、处理或模型训练。以下为下载该数据集并选取`"academic"`拆分子集的示例代码,也可直接向`load_dataset`传入拆分参数以快速加载。 >>> from datasets import load_dataset >>> ds = load_dataset("tomg-group-umd/GenQA") >>> ds DatasetDict({ code: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 513483 }) dialog: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 819154 }) general: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 304920 }) math: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 515509 }) mmlu: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 2409841 }) multiple_choice: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 372610 }) writing: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 932362 }) academic: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 4210076 }) task: Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 1004179 }) }) >>> ds["academic"] Dataset({ features: ['text', 'prompt', 'template', 'idx', 'category'], num_rows: 4210076 }) ### 对话模板 该数据集的多个拆分子集均包含单轮与多轮对话数据。为便于使用,所有对话(存储于`text`字段中)均采用与JSON对象列表一致的格式。该格式的设计目的是兼容 [`transformers`](https://github.com/huggingface/transformers) 库原生支持的[对话模板](https://huggingface.co/docs/transformers/main/en/chat_templating)功能。以下为`dialog`拆分子集的一段示例数据: >>> ds["dialog"][123]["text"] [{'content': 'What were the Opium Wars?', 'role': 'user'}, {'content': 'The Opium Wars were a series of conflicts in the 19th century between the British Empire and the Qing dynasty of China over the British desire to sell opium to China. The First Opium War (1839-1842) ended with the Treaty of Nanking, which gave the British control over Hong Kong and opened five Chinese ports to foreign trade. The Second Opium War (1856-1860) ended with the Treaty of Tientsin, which gave the British and French further concessions in China, including the right to station troops in Beijing. The Opium Wars had a profound impact on China, weakening the Qing dynasty and leading to the rise of foreign imperialism in the country.', 'role': 'assistant'}, {'content': 'What role did opium play in the wars?', 'role': 'user'}, {'content': "Opium played a central role in the Opium Wars. The British East India Company had been selling opium to China since the late 18th century, and by the early 19th century, opium had become one of China's most valuable imports. However, the Qing government was concerned about the growing opium addiction problem in the country and in 1839, it banned the import of opium. The British government, which was heavily dependent on the opium trade for revenue, refused to accept the ban, and this led to the outbreak of the First Opium War.", 'role': 'assistant'}, {'content': 'How did the wars affect China?', 'role': 'user'}, {'content': 'The Opium Wars had a devastating impact on China. The wars weakened the Qing dynasty and led to the rise of foreign imperialism in the country. The Treaty of Nanking gave the British control over Hong Kong and opened five Chinese ports to foreign trade. This led to an influx of foreign goods and ideas into China, which had a profound impact on Chinese society. The wars also led to the loss of Chinese territory, as the British acquired Hong Kong and other territories in Southeast Asia.', 'role': 'assistant'}, {'content': 'What were the long-term consequences of the Opium Wars?', 'role': 'user'}, {'content': 'The Opium Wars had a number of long-term consequences for China. The wars weakened the Qing dynasty and led to the rise of foreign imperialism in the country. The wars also led to the loss of Chinese territory, as the British acquired Hong Kong and other territories in Southeast Asia. The wars also had a significant impact on Chinese society, as they led to an influx of foreign goods and ideas into the country. This led to a number of changes in Chinese society, including the growth of nationalism and the decline of traditional Chinese values.', 'role': 'assistant'}] ## 数据集详情 ### 数据集概述 配套手稿中对每个拆分子集均有详细说明,各拆分子集对应不同的主题范畴。每个拆分子集中的样本均为人类用户与大语言模型(LLM)虚拟助手的模拟对话。部分拆分子集还包含多轮对话样本,人类与助手会交替提出问题并给出回复。每条样本均采用Unicode格式的文本,代表用户与助手的对话内容,且单轮与多轮对话使用统一的格式规范。 - **论文**:[GenQA: 从少量提示词生成百万级指令数据集 (2406.10323)](https://arxiv.org/abs/2406.10323)] - **整理方**:美国马里兰大学帕克分校研究团队。 - **语言(自然语言处理)**:英语。 - **授权协议**:本数据集采用知识共享非商业(CC BY-NC 4.0)协议发布。 ### 数据集用途 本数据集面向专注于大规模大语言模型(LLM)微调相关研究的科研人员与学术群体。我们期望该数据集能够支持开展可控的科学研究,从而推动相关领域的学术进步,并最终开发出解决现有模型缺陷的新技术。 ### 直接适用场景 本数据集适用于工业级微调实践的开放研究场景,可借助其规模与多样性优势推动该领域的学术认知升级。 ### 不适用场景 本数据集不得用于需要验证事实准确性、关键决策制定的应用场景,或任何恶意、不道德的活动。 ## 数据集结构 手稿与论文中详细说明了各拆分子集的整理规则。 数据集字段说明: - text:用于训练的实际文本(包含输入与输出内容) - prompt:指令生成的具体要求 - template:生成指令所使用的提示词模板类型的可选说明 - category:将代码类数据分为编程、库与Markdown三个子类 ## 数据集构建 ### 整理初衷 GenQA 的构建旨在验证自主生成指令数据集的有效性,并打造规模可与商业指令数据集媲美的科研级指令数据集。 ### 源数据 本数据集包含模拟用户与助手对话的问答对。 #### 数据收集与处理 所有问题均由Gemini语言模型生成,并在处理阶段完成去重。整个收集与处理流程均通过Python实现。 #### 源数据生产者 联络人:Tom Goldstein(邮箱:tomg@umd.edu) ### 标注信息(可选) 无。 #### 标注流程 无。 #### 标注人员 无。 #### 个人与敏感信息 无。 ## 偏差、风险与局限性 本数据集可能包含事实性错误,且未经过人工事实核查。 ### 使用建议 用户应知晓数据集可能存在事实性错误,使用时需谨慎验证关键信息,并采取措施缓解潜在的偏差与错误。 ## 引用信息(可选) **BibTeX格式**: 需补充更多信息 **APA格式**: 需补充更多信息 ## 术语表(可选) 需补充更多信息 ## 更多信息(可选) 需补充更多信息 ## 数据集卡片作者(可选) 需补充更多信息 ## 数据集卡片联络人 需补充更多信息
提供机构:
maas
创建时间:
2024-06-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作