botp/Open-Platypus

Name: botp/Open-Platypus
Creator: botp
Published: 2023-08-17 08:56:35
License: 暂无描述

Hugging Face2023-08-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/botp/Open-Platypus

下载链接

链接失效反馈

官方服务：

资源简介：

OpenPlatypus数据集专注于提高大型语言模型（LLM）的逻辑推理能力，并用于训练Platypus2模型。该数据集由多个子数据集组成，包括PRM800K、ScienceQA、SciBench、ReClor、TheoremQA等，这些数据集通过关键词搜索和Sentence Transformers进行过滤，去除相似度超过80%的问题。此外，还移除了大约200个出现在Hugging Face基准测试集中的问题。数据集的特征包括输入、输出和指令，均为字符串类型，训练集包含24,926个示例，总大小为30,418,784字节。

The OpenPlatypus dataset focuses on enhancing the logical reasoning capabilities of Large Language Models (LLMs) and is utilized for training the Platypus2 model. It comprises multiple sub-datasets including PRM800K, ScienceQA, SciBench, ReClor, TheoremQA, among others. These sub-datasets are filtered through keyword search and Sentence Transformers to remove questions with a similarity score exceeding 80%. Additionally, roughly 200 questions appearing in the Hugging Face benchmark datasets were also removed. The dataset features inputs, outputs, and instructions, all of which are string-type data. The training set contains 24,926 examples, with a total size of 30,418,784 bytes.

提供机构：

botp

原始信息汇总

OpenPlatypus 数据集概述

数据集配置

默认配置 (default)
- 训练数据文件路径: data/train-*

数据集信息

特征:
- input: 类型为 string
- output: 类型为 string
- instruction: 类型为 string
数据分割:
- train: 包含 24926 个样本，总字节数为 30418784
下载大小: 15545530 字节
数据集大小: 30418784 字节

语言

英语 (en)

数据集大小分类

10K < n < 100K

数据来源

该数据集由多个子数据集组成，通过关键词搜索和 Sentence Transformers 过滤相似度高于 80% 的问题：
- PRM800K: MIT 许可证
- ScienceQA: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 许可证
- SciBench: MIT 许可证
- ReClor: 非商业许可证
- TheoremQA: MIT 许可证
- nuprl/leetcode-solutions-python-testgen-gpt4: 未列出许可证
- jondurbin/airoboros-gpt4-1.4.1: 其他许可证
- TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k: Apache-2.0 许可证
- openbookQA: Apache-2.0 许可证
- ARB: MIT 许可证
- timdettmers/openassistant-guanaco: Apache-2.0 许可证

数据污染检查

移除了约 200 个在 Hugging Face 基准测试集中出现的问题。

引用

bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} }

bibtex @article{lightman2023lets, title={Lets Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} }

bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} }

bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 }

bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} }

bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} }

bibtex @inproceedings{OpenBookQA2018, title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering}, author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal}, booktitle={EMNLP}, year={2018} }

bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} }

搜集汇总

数据集介绍

构建方式

Open-Platypus数据集的构建是基于对多个子数据集的整合与筛选，涵盖科学问题解答、编程挑战和逻辑推理等领域。这些子数据集通过关键词搜索和Sentence Transformers技术进行筛选，移除了相似度超过80%的问题，以确保数据的质量和多样性。

特点

该数据集的特点在于其专注于提升大型语言模型（LLM）的逻辑推理能力，并已被用于训练Platypus2模型。数据集包含多个具有不同许可类型的高质量子数据集，经过精心筛选和处理，减少了数据污染，保证了训练模型的效率和公正性。

使用方法

用户可以通过Hugging Face平台直接下载Open-Platypus数据集，并根据数据集提供的train split进行模型训练。同时，数据集的构建和筛选代码可在Platypus的GitHub仓库中找到，方便用户进行复现和进一步的研究。

背景与挑战

背景概述

OpenPlatypus数据集，旨在提升大型语言模型（LLM）的逻辑推理能力，是Platypus2模型的训练基础。该数据集的创建可追溯至2023年，由Ariel N. Lee、Cole J. Hunter和Nataniel Ruiz等研究人员主导，汇集了包括PRM800K、ScienceQA、SciBench等多个领域的数据集，经过关键词搜索和句子转换器的筛选处理，以确保数据质量。该数据集的构建对LLM在逻辑推理领域的应用与发展产生了显著影响，为相关研究提供了重要的数据支撑。

当前挑战

OpenPlatypus数据集在构建过程中，研究人员面临了数据筛选和质量控制的挑战，特别是去除高度相似的问题以保证数据多样性。此外，数据集在解决领域问题如逻辑推理时，也面临着模型理解和准确性的挑战。为了确保数据集的有效性和公正性，研究团队还进行了数据污染检查，移除了约200个在Hugging Face基准测试集中出现的问题。

常用场景

经典使用场景

在人工智能领域，尤其是大型语言模型的研究与开发中，Open-Platypus数据集被广泛应用于提升模型的逻辑推理能力。该数据集通过整合多个来源的科学问题，经过精心筛选与处理，成为训练如Platypus2等模型的重要资源。

解决学术问题

Open-Platypus数据集解决了传统数据集在逻辑推理方面训练不足的问题，为研究者在科学问答、代码生成和阅读理解等领域的学术研究提供了强有力的数据支撑。它的出现极大地推动了大型语言模型在逻辑推理任务上的性能提升。

衍生相关工作

基于Open-Platypus数据集，学术界衍生出了一系列相关研究工作，如SciBench、ReClor和TheoremQA等，这些工作进一步拓展了数据集的应用范围，并在各自的领域内取得了显著的研究成果。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集