five

cfahlgren1/Capybara-Converted

收藏
Hugging Face2024-01-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/cfahlgren1/Capybara-Converted
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - conversational - question-answering - text-generation language: - en tags: - Physics - Biology - Math - Chemistry - Culture - Logic - Roleplay pretty_name: LessWrong-Amplify-Instruct size_categories: - 10K<n<100K --- ## This is the Official Capybara dataset. Over 10,000 multi-turn examples. Capybara is the culmination of insights derived from synthesis techniques like Evol-instruct (used for WizardLM), Alpaca, Orca, Vicuna, Lamini, FLASK and others. The single-turn seeds used to intiate the Amplify-Instruct synthesis of conversations are mostly based on datasets that i've personally vetted extensively, and are often highly regarded for their diversity and demonstration of logical robustness and prose, such as Airoboros, Know logic, EverythingLM, GPTeacher and even entirely new seed instructions derived from different sources, including certain in-house multi-turn datasets like Dove and Verified-Camel(A successor to Puffin). The multi-turn synthetic conversation generation method is what i'm calling Amplify-Instruct, and the first resulting dataset using this method is called Capybara. This dataset has a strong focus on information diversity across a wide range of domains, and multi-turn conversations that strongly emphasize reasoning, logic and extrapolation about a wide range of subjects, also many great examples of conversations delving into obscure sub-topics and rabbit holes across pop-culture and STEM, while also maintaining natural prose. While performing great in it's current state, the current dataset used for fine-tuning is entirely contained within 20K training examples, this is 10 times smaller than many similar performing datasets, this is signficant when it comes to scaling implications once I decide to scale the use of Amplify-Instruct to significantly more examples. - Most tokens contained in this dataset are newly synthesized and did not exist prior online. - This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations. - Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn) - Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics. - Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs" ## Benchmarks. - Resulting benchmarks are available on HF Leaderboard, and other benchmarks done as well such as AGIEval, Bigbench and GPT4All. - (The only Capybara model available on all of these benchmarks including HF leaderboard is Capybara V1, trained on Llama-2) - The below benchmarks are compared against fine-tunes also done on Llama-2. ![Capybara](https://i.imgur.com/OpajtNJ.jpeg) ![Capybara](https://i.imgur.com/daIZn6n.jpeg) ## Quality filtering and cleaning. - Extensive measures were done to filter out any conversations that contained even a single instance of overt AI moralizing/alignment, such as "As an AI language model" and common undesirable behaviours such as conversations that include "September 2021" and "I don't have personal beliefs" and other phrases I've found to be highly correlated with undesirable responses and conversation paths. ## Thank you to those of you that have indirectly contributed! While most of the tokens within Capybara are newly synthsized and part of datasets like Puffin/Dove, we would like to credit the single-turn datasets we leveraged as seeds, which were used to generate the multi-turn data. The datasets shown in green below are datasets that we sampled from to curate seeds that are used during Amplify-Instruct synthesis for this project, however, most of the tokens in capybara within those given sections are novel tokens not present in any of the seed datasets. Datasets in Blue are in-house curations that previously existed prior to Capybara, and were now used as seeds for Capybara. ![Capybara](https://i.imgur.com/yB58OoD.jpeg) ## Dataset contamination. We have checked the capybara dataset for contamination for several of the most popular benchmarks and can confirm that there is no contaminaton found besides MT-bench which is now cleaned out. We leveraged minhash to check for 100%, 99%, 98% and 97% similarity matches between our data and the questions and answers in benchmarks, we found no exact matches, nor did we find any matches down to the 97% similarity level. The following are benchmarks we checked for contamination against our dataset: - HumanEval - AGIEval - TruthfulQA - MMLU - GPT4All *Newly cleaned out as of 12/15/2023 - MT-bench ## Credits During the curation process, there can be some relatively arduos steps when it comes to actually executing on the best experimentation or concepts for how to filter examples out. Luckily there is folks over at Nous Research that helped with expediting these processes, big thank you to J-Supha specifically for making these types of significant contributions. ## Example Outputs from the Llama-2 7B model trained on this dataset: ![Capybara](https://img001.prntscr.com/file/img001/T9yYxR1xQSaK_UGdy3t2Cw.png) ![Capybara](https://img001.prntscr.com/file/img001/DQXqmKbsQQOIcgny1eoGNA.png) ![Capybara](https://img001.prntscr.com/file/img001/85X3L9ZxTsOKo3fUQ7GRVA.png) ## Future Plans & How you can help! This is a relatively early build amongst the grand plans for the future of what I plan to work on! In the near future we plan on leveraging the help of domain specific expert volunteers to eliminate any mathematically/verifiably incorrect answers from training curations of different types of datasets. If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact LDJ on discord! Citation: ``` @article{daniele2023amplify-instruct, title={Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for Effecient LLM Training.}, author={Daniele, Luigi and Suphavadeeprasit}, journal={arXiv preprint arXiv:(coming soon)}, url={https://huggingface.co/datasets/LDJnr/Capybara}, year={2023} } ```
提供机构:
cfahlgren1
原始信息汇总

数据集概述

基本信息

  • 许可证: Apache-2.0
  • 任务类别:
    • 对话
    • 问答
    • 文本生成
  • 语言: 英语
  • 标签:
    • 物理
    • 生物
    • 数学
    • 化学
    • 文化
    • 逻辑
    • 角色扮演
  • 数据集名称: LessWrong-Amplify-Instruct
  • 数据集大小: 10K<n<100K

数据集描述

  • 数据集来源: 基于多种合成技术(如Evol-instruct、Alpaca、Orca等)的综合洞察。
  • 种子数据: 主要基于经过严格审查的数据集,如Airoboros、Know logic、EverythingLM等。
  • 合成方法: 采用名为Amplify-Instruct的方法,将高质量的单轮种子扩展为深入的多轮对话。
  • 数据特点:
    • 强调推理、逻辑和跨学科主题的扩展。
    • 包含大量涉及流行文化和STEM领域的深奥子话题的对话。
    • 平均每个对话超过1,000个令牌和3轮以上。
  • 质量控制:
    • 严格过滤,移除任何可能的道德化/对齐示例和常见的不良行为。
    • 使用minhash检查与多个基准的相似性,确保无污染。

未来计划

  • 计划利用领域专家志愿者来消除训练数据中的数学/可验证错误答案。

引用

@article{daniele2023amplify-instruct, title={Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for Effecient LLM Training.}, author={Daniele, Luigi and Suphavadeeprasit}, journal={arXiv preprint arXiv:(coming soon)}, url={https://huggingface.co/datasets/LDJnr/Capybara}, year={2023} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作