andysalerno/rainbowfish-v1
收藏Hugging Face2024-02-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/andysalerno/rainbowfish-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: source
dtype: string
- name: input
dtype: string
- name: output
dtype: string
splits:
- name: train
num_bytes: 148983951
num_examples: 69980
download_size: 70573434
dataset_size: 148983951
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
## Formatting
Formatting is compliant with ChatML. "input" is the context and "output" is the expected model output to train on.
## Details
See the repo file `generate_dataset.py` for exactly this dataset was generated.
An opinionated and filtered mix of the following datasets:
- argilla/ultrafeedback-binarized-preferences-cleaned
- heegyu/glaive-function-calling-v2-formatted
- berkeley-nest/Nectar
- argilla/distilabel-math-preference-dpo
## argilla/ultrafeedback-binarized-preferences-cleaned
**Filter**:
- `chosen-rating == 5` AND
- `len(chosen) == 2`
## heegyu/glaive-function-calling-v2-formatted
**Filter**:
- `function_description != ''`
**Transforms**:
- Added a system message randomly selected from a pool of generic system messages.
## berkeley-nest/Nectar
**Filter**:
- has an answer with `rank == 1` AND
- `turns > 1` AND
- `good_natured == True` AND
- answer.to_lower() does not start with "i'm sorry"
## argilla/distilabel-math-preference-dpo
**Filter**:
- `chosen-rating >= 9`
**Transforms**:
- Added a system message randomly selected from a pool of generic system messages.
- Removed the phrase "Take a deep breath, think step by step, and give an accurate response"
## Global formatting
All the above datasets were formatted to comply with ChatML.
提供机构:
andysalerno
原始信息汇总
数据集信息
特征
- source: 数据来源,类型为字符串。
- input: 输入内容,类型为字符串。
- output: 输出内容,类型为字符串。
数据分割
- train: 训练集,包含148,983,951字节,69,980个样本。
数据大小
- 下载大小: 70,573,434字节
- 数据集大小: 148,983,951字节
配置
- default: 默认配置,包含训练集数据文件路径为
data/train-*。
数据集生成细节
数据集来源
-
argilla/ultrafeedback-binarized-preferences-cleaned:
- 过滤条件:
chosen-rating == 5且len(chosen) == 2
- 过滤条件:
-
heegyu/glaive-function-calling-v2-formatted:
- 过滤条件:
function_description != - 转换操作: 随机添加一个系统消息。
- 过滤条件:
-
berkeley-nest/Nectar:
- 过滤条件: 包含一个
rank == 1的答案,且turns > 1,good_natured == True,答案不以"Im sorry"开头。
- 过滤条件: 包含一个
-
argilla/distilabel-math-preference-dpo:
- 过滤条件:
chosen-rating >= 9 - 转换操作: 随机添加一个系统消息,移除短语"Take a deep breath, think step by step, and give an accurate response"。
- 过滤条件:
全局格式
所有上述数据集都已格式化为符合ChatML标准。



