erhwenkuo/openorca-chinese-zhtw
收藏数据集概述
数据集信息
- 特征字段:
id: 字符串类型,唯一编号标识,包含 niv, t0, cot, 或 flan 以表示来源 FLAN Collection 子集。system_prompt: 字符串类型,向 GPT-3.5 或 GPT-4 API 展示的系统提示。question: 字符串类型,来自 FLAN Collection 的问题条目。response: 字符串类型,通过查询 GPT-3.5 或 GPT-4 获得的回答。
- 数据分割:
train: 包含 4233915 条数据,总大小为 6491661288 字节。
- 下载大小: 4106469779 字节
- 数据集大小: 6491661288 字节
- 语言: 中文(繁体)
- 许可证: MIT
- 任务类别:
- 对话
- 文本分类
- 标记分类
- 表格问答
- 问答
- 零样本分类
- 摘要
- 特征提取
- 文本生成
- 文本到文本生成
- 数据集名称: openorca-chinese-zhtw
- 数据集大小类别: 10M<n<100M
数据集创建
- 创建理由: 为研究人员和开发者提供增强的文本数据源,主要用于提升 FLAN Collection 数据的详细步骤推理能力。
- 源数据: 使用与 Orca 论文中描述的分布一致的技术生成数据,但存在一些差异,如 FLAN Collection 中 CoT 数据不足等。
数据集使用
- 使用案例: 适用于语言理解、自然语言处理、机器学习模型训练和模型性能评估。
- 使用注意事项: 由于数据集仍在进行中,建议定期检查更新和改进,并遵循 Orca 论文中的指南和建议。
引用
-
OpenOrca: bibtex @misc{OpenOrca, title = {OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces}, author = {Wing Lian and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"}, year = {2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {url{https://https://huggingface.co/Open-Orca}, }
-
Orca: bibtex @misc{mukherjee2023orca, title={Orca: Progressive Learning from Complex Explanation Traces of GPT-4}, author={Subhabrata Mukherjee and Arindam Mitra and Ganesh Jawahar and Sahaj Agarwal and Hamid Palangi and Ahmed Awadallah}, year={2023}, eprint={2306.02707}, archivePrefix={arXiv}, primaryClass={cs.CL} }
-
FLAN Collection: bibtex @misc{longpre2023flan, title={The Flan Collection: Designing Data and Methods for Effective Instruction Tuning}, author={Shayne Longpre and Le Hou and Tu Vu and Albert Webson and Hyung Won Chung and Yi Tay and Denny Zhou and Quoc V. Le and Barret Zoph and Jason Wei and Adam Roberts}, year={2023}, eprint={2301.13688}, archivePrefix={arXiv}, primaryClass={cs.AI} }
-
LLaMA 2: bibtex @misc{touvron2023llama, title={Llama 2: Open Foundation and Fine-Tuned Chat Models}, author={Hugo Touvron and Louis Martin and Kevin Stone and Peter Albert and Amjad Almahairi and Yasmine Babaei and Nikolay Bashlykov and Soumya Batra and Prajjwal Bhargava and Shruti Bhosale and Dan Bikel and Lukas Blecher and Cristian Canton Ferrer and Moya Chen and Guillem Cucurull and David Esiobu and Jude Fernandes and Jeremy Fu and Wenyin Fu and Brian Fuller and Cynthia Gao and Vedanuj Goswami and Naman Goyal and Anthony Hartshorn and Saghar Hosseini and Rui Hou and Hakan Inan and Marcin Kardas and Viktor Kerkez and Madian Khabsa and Isabel Kloumann and Artem Korenev and Punit Singh Koura and Marie-Anne Lachaux and Thibaut Lavril and Jenya Lee and Diana Liskovich and Yinghai Lu and Yuning Mao and Xavier Martinet and Todor Mihaylov and Pushkar Mishra and Igor Molybog and Yixin Nie and Andrew Poulton and Jeremy Reizenstein and Rashi Rungta and Kalyan Saladi and Alan Schelten and Ruan Silva and Eric Michael Smith and Ranjan Subramanian and Xiaoqing Ellen Tan and Binh Tang and Ross Taylor and Adina Williams and Jian Xiang Kuan and Puxin Xu and Zheng Yan and Iliyan Zarov and Yuchen Zhang and Angela Fan and Melanie Kambadur and Sharan Narang and Aurelien Rodriguez and Robert Stojnic and Sergey Edunov and Thomas Scialom}, year={2023}, eprint= arXiv 2307.09288 } @software{touvron2023llama, title={LLaMA: Open and Efficient Foundation Language Models}, author={Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth{e}e and Rozi{`e}re, Baptiste and Goyal, Naman and Hambro, Eric and Azhar, Faisal and Rodriguez, Aurelien and Joulin, Armand and Grave, Edouard and Lample, Guillaume}, journal={arXiv preprint arXiv:2302.13971}, year={2023} }



