sft_datablend_v1
收藏魔搭社区2025-10-09 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/nv-community/sft_datablend_v1
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card
This dataset is a blend of publicly available datasets for instruction tuning, including samples from OASST, CodeContests, FLAN, T0, Open_Platypus, and GSM8K.
Note that for datasets consisting of multiple subsets, we only include subsets with permissive license for commercial use.
As a data blend, some subsets may have been sampled for more than one epoch depending on sampling ratios and dataset sizes.
## Dataset
The dataset consists of four columns:
1. conversations: user and assistant turns in a conversational format
2. mask: the turns that losses are not calculated on ("User" by default)
3. system: system prompt (empty by default)
4. dataset: dataset source
## License
The detailed license information for all the data sources utilized in the blend are listed below.
It is usable for commercial purposes as long as you follow the terms of the licenses.
| Dataset Name | License Type
| -------- | -------- |
| [OASST](https://huggingface.co/datasets/OpenAssistant/oasst1) | Apache-2.0 |
| [CodeContests](https://github.com/google-deepmind/code_contests) | CC-BY-4.0 |
| [MNLI](https://huggingface.co/datasets/multi_nli) | OANC / Creative Commons Share-Alike 3.0 Unported / Creative Commons Attribution 3.0 Unported |
| [QNLI](https://gluebenchmark.com/tasks) | CC-BY-SA-4.0 |
| [WNLI](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) | Creative Commons Attribution 4.0 International License |
| [BooLQ](https://huggingface.co/datasets/google/boolq) | CC-BY-SA-3.0 |
| [DROP](https://paperswithcode.com/dataset/drop) | CC-BY-SA-4.0 |
| [OpenbookQA](https://github.com/allenai/OpenBookQA) | Apache-2.0 |
| [SQuAD v1](https://paperswithcode.com/dataset/squad) | CC-BY-SA-4.0 |
| [SQuAD v2](https://paperswithcode.com/dataset/squad) | CC-BY-SA-4.0 |
| [COPA](https://people.ict.usc.edu/~gordon/copa.html) | BSD 2-Clause License |
| [HellaSwag](https://github.com/rowanz/hellaswag/blob/master) | MIT |
| [PIQA](https://yonatanbisk.com/piqa/) |Academic Free License (“AFL”) v. 3.0 |
| [StoryCloze](https://cs.rochester.edu/nlp/rocstories/) | [Custom](https://docs.google.com/forms/d/e/1FAIpQLSe83zPs21IGH9-HC1SuUa2hfyopJOHgTHft--Ne4SOj0VoViA/viewform?c=0&w=1) |
| [ARC](https://huggingface.co/datasets/ai2_arc) | CC-BY-SA-4.0 |
| [NQ](https://huggingface.co/datasets/nq_open) | CC-BY-SA-3.0 |
| [TriviaQA](https://github.com/mandarjoshi90/triviaqa) | Apache-2.0 |
| [Paws Wiki](https://github.com/google-research-datasets/paws) | [Custom](https://github.com/google-research-datasets/paws/blob/master/LICENSE) |
| [Winogrande](https://winogrande.allenai.org/) | CC-BY |
| [WSC273](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) | Creative Commons Attribution 4.0 International License |
| [CosmosQA](https://wilburone.github.io/cosmos/) | CC-BY-4.0 |
| [ReCoRD CNN/Daily Mail](https://sheng-z.github.io/ReCoRD-explorer/) | Apache-2.0 |
| [DART](https://github.com/Yale-LILY/dart) | MIT |
| [E2ENLG](https://github.com/tuetschek/e2e-dataset) | CC-BY-SA-4.0 |
| [QuAC](https://quac.ai/) | CC-BY-SA-4.0 |
| [Mathematics](https://github.com/deepmind/mathematics_dataset) | Apache-2.0 |
| [SNLI](https://nlp.stanford.edu/projects/snli/) | CC-BY-SA-4.0 |
| [Adversarial QA](https://huggingface.co/datasets/adversarial_qa) | CC-BY-SA-4.0 |
| [Amazon Polarity](https://huggingface.co/datasets/amazon_polarity) | Apache-2.0 |
| [DBPedia](https://huggingface.co/datasets/dbpedia_14) | CC-BY-SA-3.0 |
| [DuoRC](https://huggingface.co/datasets/duorc) | MIT |
| [Hotpot QA](https://huggingface.co/datasets/kilt_tasks/viewer/hotpotqa) | MIT |
| [QASC](https://huggingface.co/datasets/qasc) | CC-BY-4.0 |
| [Quarel](https://allenai.org/data/quarell) | CC-BY |
| [QuaRTz](https://allenai.org/data/quartz) | CC-BY |
| [Quoref](https://huggingface.co/datasets/quoref) | CC-BY-4.0 |
| [ROPES](https://huggingface.co/datasets/ropes) | CC-BY-4.0 |
| [Social IQA](https://allenai.org/data/socialiqa) | CC-BY |
| [Wiki Bio](https://huggingface.co/datasets/wiki_bio) | CC-BY-SA-3.0 |
| [Wiki Hop](https://huggingface.co/datasets/wiki_hop) | CC-BY-SA-3.0 |
| [ARB](https://github.com/TheDuckAI/arb) | CC-BY-4.0 |
| [tigerbot-kaggle-leetcodesolutions-en-2k](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k) | Apache-2.0 |
| [SciBench](https://github.com/mandyyyyii/scibench) | MIT |
| [PRM800K](https://github.com/openai/prm800k) | MIT |
| [GSM8K](https://github.com/openai/grade-school-math) | MIT |
# 数据集卡片
本数据集为面向**指令微调(instruction tuning)**的公开数据集混合集合,涵盖源自OASST、CodeContests、FLAN、T0、Open_Platypus以及GSM8K的样本。
请注意,对于包含多个子数据集的数据源,我们仅保留允许商用的宽松许可子集。
作为混合数据集,部分子集可能会根据采样比例与数据集规模进行多轮次采样。
## 数据集
本数据集包含四列:
1. **对话(conversations)**:采用对话格式的用户与助手交互轮次
2. **掩码(mask)**:无需计算损失的交互轮次(默认设置为“用户”轮次)
3. **系统提示(system prompt)**:系统提示词(默认为空)
4. **数据集来源(dataset source)**:该样本所属的数据集名称
## 许可
本混合数据集所使用的全部数据源的详细许可信息如下所列:
只要遵守各许可的条款,即可将其用于商业用途。
| 数据集名称 | 许可类型 |
| -------- | -------- |
| [OASST](https://huggingface.co/datasets/OpenAssistant/oasst1) | Apache许可证2.0版(Apache-2.0) |
| [CodeContests](https://github.com/google-deepmind/code_contests) | 知识共享署名4.0国际许可协议(CC-BY-4.0) |
| [MNLI](https://huggingface.co/datasets/multi_nli) | 开放美国国家语料库(OANC)/知识共享署名-相同方式共享3.0未移植版(CC BY-SA 3.0)/知识共享署名3.0未移植版(CC BY 3.0) |
| [QNLI](https://gluebenchmark.com/tasks) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [WNLI](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) | 知识共享署名4.0国际许可协议(CC BY 4.0) |
| [BooLQ](https://huggingface.co/datasets/google/boolq) | 知识共享署名-相同方式共享3.0未移植版(CC-BY-SA-3.0) |
| [DROP](https://paperswithcode.com/dataset/drop) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [OpenbookQA](https://github.com/allenai/OpenBookQA) | Apache许可证2.0版(Apache-2.0) |
| [SQuAD v1](https://paperswithcode.com/dataset/squad) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [SQuAD v2](https://paperswithcode.com/dataset/squad) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [COPA](https://people.ict.usc.edu/~gordon/copa.html) | BSD 2条款许可证(BSD 2-Clause License) |
| [HellaSwag](https://github.com/rowanz/hellaswag/blob/master) | MIT许可证(MIT) |
| [PIQA](https://yonatanbisk.com/piqa/) | 学术自由许可证3.0版(Academic Free License v.3.0) |
| [StoryCloze](https://cs.rochester.edu/nlp/rocstories/) | [自定义许可](https://docs.google.com/forms/d/e/1FAIpQLSe83zPs21IGH9-HC1SuUa2hfyopJOHgTHft--Ne4SOj0VoViA/viewform?c=0&w=1) |
| [ARC](https://huggingface.co/datasets/ai2_arc) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [NQ](https://huggingface.co/datasets/nq_open) | 知识共享署名-相同方式共享3.0未移植版(CC-BY-SA-3.0) |
| [TriviaQA](https://github.com/mandarjoshi90/triviaqa) | Apache许可证2.0版(Apache-2.0) |
| [Paws Wiki](https://github.com/google-research-datasets/paws) | [自定义许可](https://github.com/google-research-datasets/paws/blob/master/LICENSE) |
| [Winogrande](https://winogrande.allenai.org/) | 知识共享署名许可(CC-BY) |
| [WSC273](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) | 知识共享署名4.0国际许可协议(CC BY 4.0) |
| [CosmosQA](https://wilburone.github.io/cosmos/) | 知识共享署名4.0国际许可协议(CC-BY-4.0) |
| [ReCoRD CNN/Daily Mail](https://sheng-z.github.io/ReCoRD-explorer/) | Apache许可证2.0版(Apache-2.0) |
| [DART](https://github.com/Yale-LILY/dart) | MIT许可证(MIT) |
| [E2ENLG](https://github.com/tuetschek/e2e-dataset) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [QuAC](https://quac.ai/) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [Mathematics](https://github.com/deepmind/mathematics_dataset) | Apache许可证2.0版(Apache-2.0) |
| [SNLI](https://nlp.stanford.edu/projects/snli/) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [Adversarial QA](https://huggingface.co/datasets/adversarial_qa) | 知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0) |
| [Amazon Polarity](https://huggingface.co/datasets/amazon_polarity) | Apache许可证2.0版(Apache-2.0) |
| [DBPedia](https://huggingface.co/datasets/dbpedia_14) | 知识共享署名-相同方式共享3.0未移植版(CC-BY-SA-3.0) |
| [DuoRC](https://huggingface.co/datasets/duorc) | MIT许可证(MIT) |
| [Hotpot QA](https://huggingface.co/datasets/kilt_tasks/viewer/hotpotqa) | MIT许可证(MIT) |
| [QASC](https://huggingface.co/datasets/qasc) | 知识共享署名4.0国际许可协议(CC-BY-4.0) |
| [Quarel](https://allenai.org/data/quarell) | 知识共享署名许可(CC-BY) |
| [QuaRTz](https://allenai.org/data/quartz) | 知识共享署名许可(CC-BY) |
| [Quoref](https://huggingface.co/datasets/quoref) | 知识共享署名4.0国际许可协议(CC-BY-4.0) |
| [ROPES](https://huggingface.co/datasets/ropes) | 知识共享署名4.0国际许可协议(CC-BY-4.0) |
| [Social IQA](https://allenai.org/data/socialiqa) | 知识共享署名许可(CC-BY) |
| [Wiki Bio](https://huggingface.co/datasets/wiki_bio) | 知识共享署名-相同方式共享3.0未移植版(CC-BY-SA-3.0) |
| [Wiki Hop](https://huggingface.co/datasets/wiki_hop) | 知识共享署名-相同方式共享3.0未移植版(CC-BY-SA-3.0) |
| [ARB](https://github.com/TheDuckAI/arb) | 知识共享署名4.0国际许可协议(CC-BY-4.0) |
| [tigerbot-kaggle-leetcodesolutions-en-2k](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k) | Apache许可证2.0版(Apache-2.0) |
| [SciBench](https://github.com/mandyyyyii/scibench) | MIT许可证(MIT) |
| [PRM800K](https://github.com/openai/prm800k) | MIT许可证(MIT) |
| [GSM8K](https://github.com/openai/grade-school-math) | MIT许可证(MIT) |
提供机构:
maas
创建时间:
2025-01-20



