cognitivecomputations/dolphin
收藏Hugging Face2023-12-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/cognitivecomputations/dolphin
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
configs:
- config_name: flan1m-alpaca-uncensored
data_files: flan1m-alpaca-uncensored.jsonl
- config_name: flan5m-alpaca-uncensored
data_files: flan5m-alpaca-uncensored.jsonl
---
Dolphin 🐬
https://erichartford.com/dolphin
## Dataset details
This dataset is an attempt to replicate the results of [Microsoft's Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)
Our dataset consists of:
- ~1 million of FLANv2 augmented with GPT-4 completions (flan1m-alpaca-uncensored.jsonl)
- ~3.5 million of FLANv2 augmented with GPT-3.5 completions (flan5m-alpaca-uncensored.jsonl)
We followed the submix and system prompt distribution outlined in the Orca paper. With a few exceptions. We included all 75k of CoT in the FLAN-1m dataset rather than sampling that. Also, we found that many items were duplicated, so we removed duplicates, resulting in 3.5m instructs in the ChatGPT dataset.
Then we filtered out instances of alignment, refusal, avoidance, and bias, in order to produce an uncensored model upon which can be layered your personalized alignment LoRA.
Token distribution for GPT-3.5 completions

### Loading
```python
## load GPT-4 completions
dataset = load_dataset("ehartford/dolphin",data_files="flan1m-alpaca-uncensored.jsonl")
## load GPT-3.5 completions
dataset = load_dataset("ehartford/dolphin",data_files="flan5m-alpaca-uncensored.jsonl")
```
This dataset is licensed apache-2.0 for commercial or non-commercial use.
We currently plan to release Dolphin on:
- Xgen 7b 8k
- LLaMA 13b (Non-commercial)
- MPT 30b 8k
- LLaMA 33b (Non-commercial)
- Falcon 40b
- LLaMA 65b (Non-commercial)
The Dolphin models that are released will be subject to the license of the foundational model on which it is trained. (LLaMA releases will be non-commercial)
I would like to thank the motley crew of Open Source AI/ML engineers who have worked beside me in this endeavor. Including:
- Wing "Caseus" Lian and NanoBit of OpenAccess AI Collective
- Rohan
- Teknium
- Pankaj Mathur
- Tom "TheBloke" Jobbins for quantizing and amplifying
- Special thanks to EdenCoder and chirper.ai for mentorship and financial sponsorship.
- Special thanks to Kilkonie for his very valued mentorship.
- All the other people in the Open Source AI community who have taught me and helped me along the way.
提供机构:
cognitivecomputations
原始信息汇总
数据集概述
数据集内容
- FLANv2增强数据集:
- 约100万条数据,包含GPT-4完成的内容 (
flan1m-alpaca-uncensored.jsonl) - 约350万条数据,包含GPT-3.5完成的内容 (
flan5m-alpaca-uncensored.jsonl)
- 约100万条数据,包含GPT-4完成的内容 (
数据集特点
- 遵循Orca论文中的子混合和系统提示分布,但有以下调整:
- 在FLAN-1m数据集中包含所有75k的CoT,未进行采样。
- 移除了重复项,最终ChatGPT数据集包含350万条指令。
- 过滤了对齐、拒绝、避免和偏见实例,以产生一个无审查模型,可用于个性化对齐LoRA。
许可证
- 本数据集遵循Apache-2.0许可证,适用于商业和非商业用途。
数据集加载示例
python
加载GPT-4完成的数据
dataset = load_dataset("ehartford/dolphin", data_files="flan1m-alpaca-uncensored.jsonl")
加载GPT-3.5完成的数据
dataset = load_dataset("ehartford/dolphin", data_files="flan5m-alpaca-uncensored.jsonl")



