TheoremQA, lima, im-feeling-curious, Puffin, cc_sbu_align, qa_feedback, SLF5K, blended_skill_talk, GSM-IC, ChatAlpaca, PKU-SafeRLHF-10K, Dolly, WebGPT, Code Alpaca, openapi-function-invocations-25k, LongForm

github2024-04-27 更新2024-05-31 收录

下载链接：

https://github.com/voidful/awesome-chatgpt-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

TheoremQA: 我们注释了800个QA对，涵盖350多个定理，跨越数学、EE&CS、物理和金融。 lima: LIMA: Less Is More for Alignment im-feeling-curious: 这个公共数据集是从Google的“im feeling curious”功能中提取的。 Puffin: Puffin数据集。确切的3,000个示例，每个响应都是使用GPT-4创建的。 cc_sbu_align: MiniGPT-4数据集 qa_feedback: 我们重建了ASQA数据并收集了人类反馈。我们将生成的数据集命名为qa-feedback。 SLF5K: 用于抽象摘要任务的英语数据集，包含5K独特样本。 blended_skill_talk: 一个包含7k对话的数据集，设计用于展示多种对话模式：展示个性、具有同理心和展示知识。 GSM-IC: 带有无关上下文的学校数学(GSM-IC) ChatAlpaca: 数据目前包含总共10,000次对话，95,558次发言。 PKU-SafeRLHF-10K: 这是同类中的第一个数据集，包含10k实例和安全偏好。 Dolly: 由数千名Databricks员工生成的超过15,000条记录的语料库，以使大型语言模型能够展示ChatGPT的神奇交互性。 WebGPT: 这是WebGPT项目结束时标记为适合奖励建模的所有比较的数据集。 Code Alpaca: 代码生成任务涉及20,022个样本 openapi-function-invocations-25k: 该数据集的构建涉及一种系统程序，结合了手动提取和AI辅助合成。 LongForm: 通过利用英语语料库示例和增强指令创建的LongForm数据集。

TheoremQA: We annotated 800 QA pairs, covering more than 350 theorems across mathematics, EE&CS, physics, and finance. LIMA: LIMA: Less Is More for Alignment im-feeling-curious: This public dataset is extracted from Google's 'I'm Feeling Curious' feature. Puffin: The Puffin dataset. Exactly 3,000 examples, each response created using GPT-4. cc_sbu_align: MiniGPT-4 dataset qa_feedback: We reconstructed the ASQA data and collected human feedback. We named the generated dataset qa-feedback. SLF5K: An English dataset for abstractive summarization tasks, containing 5K unique samples. blended_skill_talk: A dataset containing 7k dialogues, designed to demonstrate multiple dialogue modes: displaying personality, empathy, and knowledge. GSM-IC: School Mathematics with Irrelevant Context (GSM-IC) ChatAlpaca: The data currently contains a total of 10,000 conversations, 95,558 utterances. PKU-SafeRLHF-10K: This is the first dataset of its kind, containing 10k instances and safety preferences. Dolly: A corpus of over 15,000 records generated by thousands of Databricks employees to enable large language models to demonstrate the magical interactivity of ChatGPT. WebGPT: This is the dataset of all comparisons labeled as suitable for reward modeling at the end of the WebGPT project. Code Alpaca: Code generation tasks involving 20,022 samples openapi-function-invocations-25k: The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis. LongForm: The LongForm dataset created by leveraging English corpus examples and enhanced instructions.

创建时间：

2023-04-22

原始信息汇总

数据集概述

数据集列表

数据集名称	大小	语言	描述	许可证
TheoremQA	1K	English	We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance.	mit
lima	1K	English	LIMA: Less Is More for Alignment	CC BY-NC-SA
im-feeling-curious	3K	English	This public dataset is an extract from Googles "im feeling curious" feature.	-
Puffin	3K	English	Puffin dataset. Exactly 3,000 examples with each response created using GPT-4.	apache-2.0
cc_sbu_align	4K	English	MiniGPT-4 datadset	BSD 3-Clause License
qa_feedback	4K	English	we re-construct the ASQA data and collect human feedback for it. We name the resulting dataset as qa-feedback.	-
SLF5K	5K	English	The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization.	apache-2.0
blended_skill_talk	7K	English	A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge.	-
GSM-IC	8K	English	Grade-School Math with Irrelevant Context (GSM-IC)	-
ChatAlpaca	10K	English	The data currently contain a total of 10,000 conversations with 95,558 utterances.	Apache-2.0 license
PKU-SafeRLHF-10K	10K	English	PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences.	-
Dolly	15K	English	databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT.	CC 3.0
WebGPT	20K	English	This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project.	-
Code Alpaca	20K	English	Code generation task involving 20,022 samples	-
openapi-function-invocations-25k	25K	English	The construction of this dataset involved a systematic procedure combining manual extraction and AI-assisted synthesis.	mit
LongForm	28K	English	The LongForm dataset is created by leveraging English corpus examples with augmented instructions.	The LongForm project is subject to a MIT License with custom limitations for restrictions imposed by OpenAI (for the instruction generation part), as well as the license of language models (OPT, LLaMA, and T5).
chatbot_arena_conversations	33K	English	This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023.
HC3	37K	English, Chinese	37,175 instructions generated by ChatGPT and human	-
Anthropic_HH_Golden	45K	English	This repository contains a new preference dataset extending the harmless dataset of Anthropics Helpful and Harmless (HH) datasets. The origin positive response in HH is generated by a supervised fined-tuned model of Anthropic, where harmful and unhelpful responses are freqently encountered. In this dataset, the positive responses are replaced by re-rewritten responses generated by GPT4.
Mol-Instructions	48K	English	An open, large-scale biomolecular instruction dataset for large language models.	CC BY 4.0
RefGPT	50K	English, Chinese	we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content.	-
arxiv-math-instruct-50k	50K	English	Dataset consists of question-answer pairs derived from ArXiv abstracts from math categories	-
Traditional Chinese Alpaca Dataset	52K	Traditional Chinese	Translated from Alpaca Data by ChatGPT API	Apache-2.0 license
Cabrita Dataset	52K	Portuguese	Translated from Alpaca Data
Japanese Alpaca Dataset	52K	Japanese	Translated from Alpaca Data by ChatGPT API	CC By NC 4.0; OpenAI terms of use
Alpaca Dataset	52K	English	175 seed instructions by OpenAI API	CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned	52K	English	Revised version of Alpaca Dataset	-
Alpaca GPT-4 Data	52K	English	Generated by GPT-4 using Alpaca prompts	-
Alpaca GPT-4 Data (Chinese)	52K	Chinese	Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT	-
Dynosaur	66K	English	Dynosaur, a dynamic growth paradigm for instruction-tuning data curation.	Apache-2.0 license
Finance	69K	English	68,912 financial related instructions	-
evol	70K	English	This is the training data of WizardLM.	-
Vicuna Dataset	75K	English	~100k ShareGPT conversations	-
InstructionTranslation	80K	Multi-lingual	Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G).	MIT
Self-Instruct	82K	English	We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs.	-
OASST1	89K	Multi-lingual	a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.	apache-2.0
HH-RLHF	91K	English	The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.	MIT
Guanaco Dataset	98K	English, Simplified Chinese, Traditional Chinese HK & TW, Japanese	175 tasks from the Alpaca model	GPLv3
InstructionWild	104K	English, Chinese	429 seed instructions and follow Alpaca to generate 52K	Research only; OpenAI terms of use
Camel Dataset	107K	Multi-lingual	Role-playing between AIs (Open AI API)	-
Tapir-Cleaned	117K	English	This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning.	CC BY-NC 4.0
WizardLM_evol_instruct_V2_196k	143K	English	This datasets contains 143K mixture evolved data of Alpaca and ShareGPT.	-
LLaVA Visual Instruct	150K	English	LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability.	cc-by-nc-4.0
Prosocial Dialog	166K	English	165,681 instructions produced by GPT-3 rewrites questions and human feedback	-
COIG	191K	Chinese	Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora.	apache-2.0
orca-chat	198K	English	This is a cleaned, pruned, and clustered version of orca to form a conversation-style dataset. The the process involves removing samples with very high similarity and also grouping instructions to form conversation.
Unnatural Instructions	241K	English	a large dataset of cre- ative and diverse instructions, collected with virtually no human labor.	MIT
SHP	358K	English	SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice.	Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license
dromedary	361K	English	Dromedary-Verbose-Clone is a synthetic dataset of 360k instructions and demonstrations.	cc-by-nc-4.0
ultrachat	404K	English	To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.	cc-by-nc-4.0
ign_clean_instruct_dataset_500k	509K	English	This dataset contains ~508k prompt-instruction pairs with high quality responses. It was synthetically created from a subset of Ultrachat prompts. It does not contain any alignment focused responses or NSFW content.	apache-2.0
ELI5	559K	English	The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers.	-
GPT4All Dataset	806K	Multi-lingual	Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API.	-
Instruct	889K	English	888,969 English instructions, augmentation using AllenAI NLP tools	MIT
MOSS	1M	Chinese	Generated by gpt-3.5-turbo	Apache-2.0, AGPL-3.0 licenses
LaMini-Instruction	3M	English	a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts	cc-by-nc-4.0
OpenOrca	3M	English	The OpenOrca dataset is a collection of augmented FLAN Collection data. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
Natural Instructions	5M	Multi-lingual	5,040,134 instructions collected from diverse NLP tasks	-
BELLE	10M	Chinese	The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields.	Research only; OpenAI terms of use
Firefly	16M	Chinese	1,649,398 Chinese instructions in 23 NLP tasks	-
OIG-43M Dataset	43M	Multi-lingual	Together, LAION, and Ontocord.ai.	-
xP3	79M	Multi-lingual	78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks	-
CodeParrot	-	python	The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files.	-
Alpaca-CoT Dataset	-	Multi-lingual	Instruction Data Collection	ODC-By
stack-exchange-paired	-	English	This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training.	cc-by-sa-4.0
LangChainDatasets	-	English	This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents.	-
ParlAI	-	English	100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering.	-
GPTeacher	-	English	A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer	-
silk-road/Wizard-LM-Chinese-instruct-evol	-	Chinese	Wizard-LM-Chinese	-
MultiWOZ	-	English	Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully-labeled collection of human-human written conversations spanning over multiple domains and topics.	apache-2.0

搜集汇总

数据集介绍

构建方式

TheoremQA数据集的构建基于对800个问答对的精心标注，涵盖了数学、电子工程与计算机科学、物理学和金融学等多个领域，共涉及350多个定理。通过系统化的标注流程，确保了数据集在不同学科领域中的广泛覆盖和深度挖掘，为训练大型语言模型提供了高质量的学术问答数据。

特点

TheoremQA数据集的显著特点在于其学科领域的广泛性和定理覆盖的深度。该数据集不仅包含了数学和物理学等传统学科的问答对，还扩展至电子工程与计算机科学以及金融学等应用领域，确保了数据集的多样性和实用性。此外，数据集的规模适中，便于在资源有限的情况下进行高效训练和验证。

使用方法

TheoremQA数据集可用于训练和验证大型语言模型，特别是在需要处理复杂学术问题和定理证明的场景中。用户可以通过Hugging Face平台直接访问该数据集，并结合Python脚本进行预处理和模型训练。数据集的问答对格式使其非常适合用于开发和测试问答系统、知识图谱构建以及学术领域的智能助手。

背景与挑战

背景概述

TheoremQA数据集由Wenhu Chen等人创建，专注于数学、电气工程与计算机科学、物理学和金融领域的定理相关问答对。该数据集包含800个问答对，覆盖350多个定理，旨在为大型语言模型提供高质量的定理理解和应用训练数据。TheoremQA的创建不仅丰富了定理相关数据的资源库，还为推动数学和跨学科领域的自动化推理研究提供了重要支持。

当前挑战

TheoremQA数据集在构建过程中面临多项挑战。首先，定理的复杂性和多样性要求数据标注必须精确且全面，以确保问答对的准确性和实用性。其次，跨学科的定理覆盖增加了数据集的复杂性，需要研究人员具备广泛的知识背景。此外，如何确保数据集在不同应用场景下的通用性和有效性，也是TheoremQA面临的重要挑战。

常用场景

经典使用场景

TheoremQA 数据集的经典使用场景主要集中在数学、电气工程与计算机科学（EE&CS）、物理学和金融学等领域的定理验证和问题解答。该数据集通过提供覆盖350多个定理的800个问答对，为研究人员和开发者提供了一个丰富的资源库，用于训练和评估大型语言模型在处理复杂数学和科学问题上的能力。

实际应用

TheoremQA 数据集在实际应用中具有广泛的应用前景，特别是在教育、科研和金融分析等领域。在教育领域，该数据集可以用于开发智能辅导系统，帮助学生理解和应用复杂的数学和科学定理。在科研领域，TheoremQA 可以作为工具，辅助研究人员进行定理验证和问题解答，提高研究效率。在金融分析中，该数据集能够支持复杂的金融模型验证，提升金融决策的科学性和准确性。

衍生相关工作

TheoremQA 数据集的发布催生了一系列相关的经典工作，特别是在定理验证和问答系统领域。许多研究者利用该数据集进行模型训练和评估，开发出更高效的定理验证算法和问答模型。此外，TheoremQA 还激发了对多领域知识整合的研究，推动了跨学科的智能系统开发。这些衍生工作不仅提升了模型的性能，还为相关领域的研究提供了新的思路和方法。

以上内容由遇见数据集搜集并总结生成