five

Open-Platypus

收藏
魔搭社区2026-05-15 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Open-Platypus
下载链接
链接失效反馈
官方服务:
资源简介:
# OpenPlatypus This dataset is focused on improving LLM logical reasoning skills and was used to train the Platypus2 models. It is comprised of the following datasets, which were filtered using keyword search and then Sentence Transformers to remove questions with a similarity above 80%: | Dataset Name | License Type | |--- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --|--- --- --- --- --| | [PRM800K](https://github.com/openai/prm800k) | MIT | | [ScienceQA](https://github.com/lupantech/ScienceQA) | [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/) | | [SciBench](https://github.com/mandyyyyii/scibench) | MIT | | [ReClor](https://whyu.me/reclor/) | Non-commercial | | [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | MIT | | [`nuprl/leetcode-solutions-python-testgen-gpt4`](https://huggingface.co/datasets/nuprl/leetcode-solutions-python-testgen-gpt4/viewer/nuprl--leetcode-solutions-python-testgen-gpt4/train?p=1) | None listed | | [`jondurbin/airoboros-gpt4-1.4.1`](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) | other | | [`TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k`](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k/viewer/TigerResearch--tigerbot-kaggle-leetcodesolutions-en-2k/train?p=2) | apache-2.0 | | [openbookQA](https://huggingface.co/datasets/openbookqa/viewer/additional/train?row=35) | apache-2.0 | | [ARB](https://arb.duckai.org) | MIT | | [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) | apache-2.0 | ## Data Contamination Check We've removed approximately 200 questions that appear in the Hugging Face benchmark test sets. Please see our [paper](https://arxiv.org/abs/2308.07317) and [project webpage](https://platypus-llm.github.io) for additional information. ## Model Info Please see models at [`garage-bAInd`](https://huggingface.co/garage-bAInd). ## Training and filtering code Please see the [Platypus GitHub repo](https://github.com/arielnlee/Platypus). ## Citations ```bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} } ``` ```bibtex @article{lightman2023lets, title={Let's Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} } ``` ```bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} } ``` ```bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 } ``` ```bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} } ``` ```bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} } ``` ```bibtex @inproceedings{OpenBookQA2018, title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering}, author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal}, booktitle={EMNLP}, year={2018} } ``` ```bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} } ```

# Open-Platypus 本数据集旨在提升大语言模型(LLM)的逻辑推理能力,曾用于训练Platypus2系列模型。其由以下数据集组合而成,我们先通过关键词搜索完成初步筛选,随后使用Sentence Transformers(句子Transformer模型)移除相似度超过80%的问题样本: | 数据集名称 | 许可证类型 | |--------------------------------------------------------------|------------------------------| | [PRM800K](https://github.com/openai/prm800k) | MIT许可证 | | [MATH](https://github.com/hendrycks/math) | MIT许可证 | | [ScienceQA](https://github.com/lupantech/ScienceQA) | 知识共享署名-非商业性使用-相同方式共享4.0国际许可协议(CC BY-NC-SA 4.0) | | [SciBench](https://github.com/mandyyyyii/scibench) | MIT许可证 | | [ReClor](https://whyu.me/reclor/) | 非商业许可 | | [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | MIT许可证 | | [`nuprl/leetcode-solutions-python-testgen-gpt4`](https://huggingface.co/datasets/nuprl/leetcode-solutions-python-testgen-gpt4/viewer/nuprl--leetcode-solutions-python-testgen-gpt4/train?p=1) | 未注明 | | [`jondurbin/airoboros-gpt4-1.4.1`](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) | 其他 | | [`TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k`](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k/viewer/TigerResearch--tigerbot-kaggle-leetcodesolutions-en-2k/train?p=2) | Apache-2.0许可证 | | [ARB](https://arb.duckai.org) | 知识共享署名4.0国际许可协议(CC BY 4.0) | | [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) | Apache-2.0许可证 | ## 数据污染检查 我们已移除约200个出现在Hugging Face基准测试集中的问题样本。更多详细信息请参阅我们的[论文](https://arxiv.org/abs/2308.07317)与[项目主页](https://platypus-llm.github.io)。 ## 模型信息 相关模型可在[`garage-bAInd`](https://huggingface.co/garage-bAInd)平台查看。 ## 训练与筛选代码 请参阅[Platypus GitHub仓库](https://github.com/arielnlee/Platypus)获取训练及筛选代码。 ## 引用 bibtex @article{platypus2023, title={Platypus:大语言模型的快速、低成本且高效优化}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv预印本 arXiv:2308.07317}, year={2023} } bibtex @article{lightman2023lets, title={逐步验证}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={arXiv预印本 arXiv:2305.20050}, year={2023} } bibtex @inproceedings{lu2022learn, title={学会解释:面向科学问答的基于思维链的多模态推理}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={第36届神经信息处理系统大会(NeurIPS 2022)}, year={2022} } bibtex @misc{wang2023scibench, title={SciBench:评估大语言模型的大学水平科学问题解决能力}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 } bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor:一个需要逻辑推理的阅读理解数据集}, booktitle = {国际学习表征会议(ICLR 2020)}, month = {April}, year={2020} } bibtex @article{chen2023theoremqa, title={TheoremQA:一个基于定理的问答数据集}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={arXiv预印本 arXiv:2305.12524}, year={2023} } bibtex @article{hendrycksmath2021, title={使用MATH数据集评估数学问题求解能力}, author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} } bibtex @misc{sawada2023arb, title={ARB:面向大语言模型的高级推理基准测试集}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} }
提供机构:
maas
创建时间:
2023-12-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作