TheoremQA, lima, im-feeling-curious, Puffin, cc_sbu_align, qa_feedback, SLF5K, blended_skill_talk, GSM-IC, ChatAlpaca, PKU-SafeRLHF-10K, Dolly, WebGPT, Code Alpaca, openapi-function-invocations-25k, LongForm
收藏github2024-04-27 更新2024-05-31 收录
下载链接:
https://github.com/voidful/awesome-chatgpt-dataset
下载链接
链接失效反馈官方服务:
资源简介:
TheoremQA: 我们注释了800个QA对,涵盖350多个定理,跨越数学、EE&CS、物理和金融。
lima: LIMA: Less Is More for Alignment
im-feeling-curious: 这个公共数据集是从Google的“im feeling curious”功能中提取的。
Puffin: Puffin数据集。确切的3,000个示例,每个响应都是使用GPT-4创建的。
cc_sbu_align: MiniGPT-4数据集
qa_feedback: 我们重建了ASQA数据并收集了人类反馈。我们将生成的数据集命名为qa-feedback。
SLF5K: 用于抽象摘要任务的英语数据集,包含5K独特样本。
blended_skill_talk: 一个包含7k对话的数据集,设计用于展示多种对话模式:展示个性、具有同理心和展示知识。
GSM-IC: 带有无关上下文的学校数学(GSM-IC)
ChatAlpaca: 数据目前包含总共10,000次对话,95,558次发言。
PKU-SafeRLHF-10K: 这是同类中的第一个数据集,包含10k实例和安全偏好。
Dolly: 由数千名Databricks员工生成的超过15,000条记录的语料库,以使大型语言模型能够展示ChatGPT的神奇交互性。
WebGPT: 这是WebGPT项目结束时标记为适合奖励建模的所有比较的数据集。
Code Alpaca: 代码生成任务涉及20,022个样本
openapi-function-invocations-25k: 该数据集的构建涉及一种系统程序,结合了手动提取和AI辅助合成。
LongForm: 通过利用英语语料库示例和增强指令创建的LongForm数据集。
TheoremQA: We annotated 800 QA pairs, covering more than 350 theorems across mathematics, EE&CS, physics, and finance.
LIMA: LIMA: Less Is More for Alignment
im-feeling-curious: This public dataset is extracted from Google's 'I'm Feeling Curious' feature.
Puffin: The Puffin dataset. Exactly 3,000 examples, each response created using GPT-4.
cc_sbu_align: MiniGPT-4 dataset
qa_feedback: We reconstructed the ASQA data and collected human feedback. We named the generated dataset qa-feedback.
SLF5K: An English dataset for abstractive summarization tasks, containing 5K unique samples.
blended_skill_talk: A dataset containing 7k dialogues, designed to demonstrate multiple dialogue modes: displaying personality, empathy, and knowledge.
GSM-IC: School Mathematics with Irrelevant Context (GSM-IC)
ChatAlpaca: The data currently contains a total of 10,000 conversations, 95,558 utterances.
PKU-SafeRLHF-10K: This is the first dataset of its kind, containing 10k instances and safety preferences.
Dolly: A corpus of over 15,000 records generated by thousands of Databricks employees to enable large language models to demonstrate the magical interactivity of ChatGPT.
WebGPT: This is the dataset of all comparisons labeled as suitable for reward modeling at the end of the WebGPT project.
Code Alpaca: Code generation tasks involving 20,022 samples
openapi-function-invocations-25k: The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis.
LongForm: The LongForm dataset created by leveraging English corpus examples and enhanced instructions.
创建时间:
2023-04-22
原始信息汇总
数据集概述
数据集列表
| 数据集名称 | 大小 | 语言 | 描述 | 许可证 |
|---|---|---|---|---|
| TheoremQA | 1K | English | We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. | mit |
| lima | 1K | English | LIMA: Less Is More for Alignment | CC BY-NC-SA |
| im-feeling-curious | 3K | English | This public dataset is an extract from Googles "im feeling curious" feature. | - |
| Puffin | 3K | English | Puffin dataset. Exactly 3,000 examples with each response created using GPT-4. | apache-2.0 |
| cc_sbu_align | 4K | English | MiniGPT-4 datadset | BSD 3-Clause License |
| qa_feedback | 4K | English | we re-construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback. | - |
| SLF5K | 5K | English | The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |
| blended_skill_talk | 7K | English | A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. | - |
| GSM-IC | 8K | English | Grade-School Math with Irrelevant Context (GSM-IC) | - |
| ChatAlpaca | 10K | English | The data currently contain a total of 10,000 conversations with 95,558 utterances. | Apache-2.0 license |
| PKU-SafeRLHF-10K | 10K | English | PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |
| Dolly | 15K | English | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
| WebGPT | 20K | English | This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. | - |
| Code Alpaca | 20K | English | Code generation task involving 20,022 samples | - |
| openapi-function-invocations-25k | 25K | English | The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. | mit |
| LongForm | 28K | English | The LongForm dataset is created by leveraging English corpus examples with augmented instructions. | The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5). |
| chatbot_arena_conversations | 33K | English | This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. | |
| HC3 | 37K | English, Chinese | 37,175 instructions generated by ChatGPT and human | - |
| Anthropic_HH_Golden | 45K | English | This repository contains a new preference dataset extending the harmless dataset of Anthropics Helpful and Harmless (HH) datasets. The origin positive response in HH is generated by a supervised fined-tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4. | |
| Mol-Instructions | 48K | English | An open, large-scale biomolecular instruction dataset for large language models. | CC BY 4.0 |
| RefGPT | 50K | English, Chinese | we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. | - |
| arxiv-math-instruct-50k | 50K | English | Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories | - |
| Traditional Chinese Alpaca Dataset | 52K | Traditional Chinese | Translated from Alpaca Data by ChatGPT API | Apache-2.0 license |
| Cabrita Dataset | 52K | Portuguese | Translated from Alpaca Data | |
| Japanese Alpaca Dataset | 52K | Japanese | Translated from Alpaca Data by ChatGPT API | CC By NC 4.0; OpenAI terms of use |
| Alpaca Dataset | 52K | English | 175 seed instructions by OpenAI API | CC By NC 4.0; OpenAI terms of use |
| Alpaca Data Cleaned | 52K | English | Revised version of Alpaca Dataset | - |
| Alpaca GPT-4 Data | 52K | English | Generated by GPT-4 using Alpaca prompts | - |
| Alpaca GPT-4 Data (Chinese) | 52K | Chinese | Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - |
| Dynosaur | 66K | English | Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. | Apache-2.0 license |
| Finance | 69K | English | 68,912 financial related instructions | - |
| evol | 70K | English | This is the training data of WizardLM. | - |
| Vicuna Dataset | 75K | English | ~100k ShareGPT conversations | - |
| InstructionTranslation | 80K | Multi-lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
| Self-Instruct | 82K | English | We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. | - |
| OASST1 | 89K | Multi-lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |
| HH-RLHF | 91K | English | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | MIT |
| Guanaco Dataset | 98K | English, Simplified Chinese, Traditional Chinese HK & TW, Japanese | 175 tasks from the Alpaca model | GPLv3 |
| InstructionWild | 104K | English, Chinese | 429 seed instructions and follow Alpaca to generate 52K | Research only; OpenAI terms of use |
| Camel Dataset | 107K | Multi-lingual | Role-playing between AIs (Open AI API) | - |
| Tapir-Cleaned | 117K | English | This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. | CC BY-NC 4.0 |
| WizardLM_evol_instruct_V2_196k | 143K | English | This datasets contains 143K mixture evolved data of Alpaca and ShareGPT. | - |
| LLaVA Visual Instruct | 150K | English | LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |
| Prosocial Dialog | 166K | English | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - |
| COIG | 191K | Chinese | Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. | apache-2.0 |
| orca-chat | 198K | English | This is a cleaned, pruned, and clustered version of orca to form a conversation-style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation. | |
| Unnatural Instructions | 241K | English | a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. | MIT |
| SHP | 358K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. | Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
| dromedary | 361K | English | Dromedary-Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations. | cc-by-nc-4.0 |
| ultrachat | 404K | English | To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | cc-by-nc-4.0 |
| ign_clean_instruct_dataset_500k | 509K | English | This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content. | apache-2.0 |
| ELI5 | 559K | English | The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
| GPT4All Dataset | 806K | Multi-lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
| Instruct | 889K | English | 888,969 English instructions, augmentation using AllenAI NLP tools | MIT |
| MOSS | 1M | Chinese | Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
| LaMini-Instruction | 3M | English | a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |
| OpenOrca | 3M | English | The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. | |
| Natural Instructions | 5M | Multi-lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
| BELLE | 10M | Chinese | The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | Research only; OpenAI terms of use |
| Firefly | 16M | Chinese | 1,649,398 Chinese instructions in 23 NLP tasks | - |
| OIG-43M Dataset | 43M | Multi-lingual | Together, LAION, and Ontocord.ai. | - |
| xP3 | 79M | Multi-lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - |
| CodeParrot | - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |
| Alpaca-CoT Dataset | - | Multi-lingual | Instruction Data Collection | ODC-By |
| stack-exchange-paired | - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
| LangChainDatasets | - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
| ParlAI | - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
| GPTeacher | - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
| silk-road/Wizard-LM-Chinese-instruct-evol | - | Chinese | Wizard-LM-Chinese | - |
| MultiWOZ | - | English | Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics. | apache-2.0 |
搜集汇总
数据集介绍

构建方式
TheoremQA数据集的构建基于对800个问答对的精心标注,涵盖了数学、电子工程与计算机科学、物理学和金融学等多个领域,共涉及350多个定理。通过系统化的标注流程,确保了数据集在不同学科领域中的广泛覆盖和深度挖掘,为训练大型语言模型提供了高质量的学术问答数据。
特点
TheoremQA数据集的显著特点在于其学科领域的广泛性和定理覆盖的深度。该数据集不仅包含了数学和物理学等传统学科的问答对,还扩展至电子工程与计算机科学以及金融学等应用领域,确保了数据集的多样性和实用性。此外,数据集的规模适中,便于在资源有限的情况下进行高效训练和验证。
使用方法
TheoremQA数据集可用于训练和验证大型语言模型,特别是在需要处理复杂学术问题和定理证明的场景中。用户可以通过Hugging Face平台直接访问该数据集,并结合Python脚本进行预处理和模型训练。数据集的问答对格式使其非常适合用于开发和测试问答系统、知识图谱构建以及学术领域的智能助手。
背景与挑战
背景概述
TheoremQA数据集由Wenhu Chen等人创建,专注于数学、电气工程与计算机科学、物理学和金融领域的定理相关问答对。该数据集包含800个问答对,覆盖350多个定理,旨在为大型语言模型提供高质量的定理理解和应用训练数据。TheoremQA的创建不仅丰富了定理相关数据的资源库,还为推动数学和跨学科领域的自动化推理研究提供了重要支持。
当前挑战
TheoremQA数据集在构建过程中面临多项挑战。首先,定理的复杂性和多样性要求数据标注必须精确且全面,以确保问答对的准确性和实用性。其次,跨学科的定理覆盖增加了数据集的复杂性,需要研究人员具备广泛的知识背景。此外,如何确保数据集在不同应用场景下的通用性和有效性,也是TheoremQA面临的重要挑战。
常用场景
经典使用场景
TheoremQA 数据集的经典使用场景主要集中在数学、电气工程与计算机科学(EE&CS)、物理学和金融学等领域的定理验证和问题解答。该数据集通过提供覆盖350多个定理的800个问答对,为研究人员和开发者提供了一个丰富的资源库,用于训练和评估大型语言模型在处理复杂数学和科学问题上的能力。
实际应用
TheoremQA 数据集在实际应用中具有广泛的应用前景,特别是在教育、科研和金融分析等领域。在教育领域,该数据集可以用于开发智能辅导系统,帮助学生理解和应用复杂的数学和科学定理。在科研领域,TheoremQA 可以作为工具,辅助研究人员进行定理验证和问题解答,提高研究效率。在金融分析中,该数据集能够支持复杂的金融模型验证,提升金融决策的科学性和准确性。
衍生相关工作
TheoremQA 数据集的发布催生了一系列相关的经典工作,特别是在定理验证和问答系统领域。许多研究者利用该数据集进行模型训练和评估,开发出更高效的定理验证算法和问答模型。此外,TheoremQA 还激发了对多领域知识整合的研究,推动了跨学科的智能系统开发。这些衍生工作不仅提升了模型的性能,还为相关领域的研究提供了新的思路和方法。
以上内容由遇见数据集搜集并总结生成



