LDJnr/Capybara
收藏Hugging Face2024-06-07 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/LDJnr/Capybara
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- conversational
- question-answering
- text-generation
language:
- en
tags:
- Physics
- Biology
- Math
- Chemistry
- Culture
- Logic
- Roleplay
pretty_name: LessWrong-Amplify-Instruct
size_categories:
- 10K<n<100K
---
## This is the Official Capybara dataset. Over 10,000 multi-turn examples.
Capybara is the culmination of insights derived from synthesis techniques like Evol-instruct (used for WizardLM), Alpaca, Orca, Vicuna, Lamini, FLASK and others.
The single-turn seeds used to initiate the Amplify-Instruct synthesis of conversations are mostly based on datasets that i've personally vetted extensively, and are often highly regarded for their diversity and demonstration of logical robustness and prose, such as Airoboros, Know logic, EverythingLM, GPTeacher and even entirely new seed instructions derived from different sources, including certain in-house multi-turn datasets like Dove and Verified-Camel(A successor to Puffin).
The multi-turn synthetic conversation generation method is what i'm calling Amplify-Instruct, and the first resulting dataset using this method is called Capybara.
This dataset has a strong focus on information diversity across a wide range of domains, and multi-turn conversations that strongly emphasize reasoning, logic and extrapolation about a wide range of subjects, also many great examples of conversations delving into obscure sub-topics and rabbit holes across pop-culture and STEM, while also maintaining natural prose.
While performing great in it's current state, the current dataset used for fine-tuning is entirely contained within 20K training examples, this is 10 times smaller than many similar performing datasets, this is signficant when it comes to scaling implications once I decide to scale the use of Amplify-Instruct to significantly more examples.
- Most tokens contained in this dataset are newly synthesized and did not exist prior online.
- This leverages the Amplify-Instruct method(paper coming soon) to grow thousands of high-quality single-turn seeds into advanced and in-depth multi-turn conversations.
- Average context length per conversation is over 1,000 tokens and 3 turns or more per example (most instruction/chat datasets on HF for fine-tuning are only 1 turn)
- Each conversation is optimized to amplify the natural raw knowledge capabilities of the model, as well as delving deep into obscure and advanced topics.
- Aggresively filtered to remove any and all possible examples of overt moralizing/alignment, and common undesirable behaviours such as "as an AI language model" and "September 2021" and "I don't have personal beliefs"
## Benchmarks.
- Resulting benchmarks are available on HF Leaderboard, and other benchmarks done as well such as AGIEval, Bigbench and GPT4All.
- (The only Capybara model available on all of these benchmarks including HF leaderboard is Capybara V1, trained on Llama-2)
- The below benchmarks are compared against fine-tunes also done on Llama-2.


## Quality filtering and cleaning.
- Extensive measures were done to filter out any conversations that contained even a single instance of overt AI moralizing/alignment, such as "As an AI language model" and common undesirable behaviours such as conversations that include "September 2021" and "I don't have personal beliefs" and other phrases I've found to be highly correlated with undesirable responses and conversation paths.
## Thank you to those of you that have indirectly contributed!
While most of the tokens within Capybara are newly synthsized and part of datasets like Puffin/Dove, we would like to credit the single-turn datasets we leveraged as seeds, which were used to generate the multi-turn data.
The datasets shown in green below are datasets that we sampled from to curate seeds that are used during Amplify-Instruct synthesis for this project, however, most of the tokens in capybara within those given sections are novel tokens not present in any of the seed datasets.
Datasets in Blue are in-house curations that previously existed prior to Capybara, and were now used as seeds for Capybara.

## Dataset contamination.
We have checked the capybara dataset for contamination for several of the most popular benchmarks and can confirm that there is no contaminaton found besides MT-bench which is now cleaned out.
We leveraged minhash to check for 100%, 99%, 98% and 97% similarity matches between our data and the questions and answers in benchmarks, we found no exact matches, nor did we find any matches down to the 97% similarity level.
The following are benchmarks we checked for contamination against our dataset:
- HumanEval
- AGIEval
- TruthfulQA
- MMLU
- GPT4All
*Newly cleaned out as of 12/15/2023 - MT-bench
## Credits:
During the curation process, there can be some relatively arduos steps when it comes to actually executing on the best experimentation or concepts for how to filter examples out.
Luckily there is folks over at Nous Research that helped with expediting these processes, big thank you to J-Supha specifically for making these types of significant contributions.
## Example Outputs from the Llama-2 7B model trained on this dataset:



## Future Plans & How you can help
This is a relatively early build amongst the grand plans for the future of what I plan to work on!
In the near future we plan on leveraging the help of domain specific expert volunteers to eliminate any mathematically/verifiably incorrect answers from training curations of different types of datasets.
If you have at-least a bachelors in mathematics, physics, biology or chemistry and would like to volunteer even just 30 minutes of your expertise time, please contact LDJ on discord!
Citation:
```
@article{daniele2023amplify-instruct,
title={Amplify-Instruct: Synthetically Generated Diverse Multi-turn Conversations for efficient LLM Training.},
author={Daniele, Luigi and Suphavadeeprasit},
journal={arXiv preprint arXiv:(coming soon)},
url={https://huggingface.co/datasets/LDJnr/Capybara},
year={2023}
}
```
提供机构:
LDJnr
原始信息汇总
数据集概述
基本信息
- 许可证: Apache-2.0
- 任务类别:
- 对话
- 问答
- 文本生成
- 语言: 英语
- 标签:
- 物理
- 生物
- 数学
- 化学
- 文化
- 逻辑
- 角色扮演
- 美观名称: LessWrong-Amplify-Instruct
- 大小类别: 10K<n<100K
数据集特点
- 生成方法: 使用Amplify-Instruct方法,将高质量的单轮种子扩展为深入的多轮对话。
- 内容多样性: 强调跨多个领域的信息多样性,以及对广泛主题的推理、逻辑和推断。
- 对话结构: 平均每段对话超过1,000个令牌,至少3轮交流。
- 质量控制: 积极过滤以移除所有可能的道德化/对齐内容,以及常见的不良行为,如AI自我参照和特定时间戳。
数据集应用
- 模型训练: 用于训练的当前数据集包含20K训练示例,强调其对于模型规模化的重要性。
- 性能评估: 在HF Leaderboard和其他基准测试如AGIEval、Bigbench和GPT4All中进行评估。
数据集清洁与贡献
- 清洁措施: 采取广泛措施过滤掉包含AI道德化/对齐内容的对话。
- 贡献: 感谢间接贡献者,特别是使用单轮数据集作为种子生成多轮数据的贡献。
未来计划
- 计划利用领域专家志愿者帮助从训练数据集中移除数学/验证上不正确的答案。



