andysalerno/rainbowfish-v1

Name: andysalerno/rainbowfish-v1
Creator: andysalerno
Published: 2024-02-08 06:06:02
License: 暂无描述

Hugging Face2024-02-08 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/andysalerno/rainbowfish-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: source dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 148983951 num_examples: 69980 download_size: 70573434 dataset_size: 148983951 configs: - config_name: default data_files: - split: train path: data/train-* --- ## Formatting Formatting is compliant with ChatML. "input" is the context and "output" is the expected model output to train on. ## Details See the repo file `generate_dataset.py` for exactly this dataset was generated. An opinionated and filtered mix of the following datasets: - argilla/ultrafeedback-binarized-preferences-cleaned - heegyu/glaive-function-calling-v2-formatted - berkeley-nest/Nectar - argilla/distilabel-math-preference-dpo ## argilla/ultrafeedback-binarized-preferences-cleaned **Filter**: - `chosen-rating == 5` AND - `len(chosen) == 2` ## heegyu/glaive-function-calling-v2-formatted **Filter**: - `function_description != ''` **Transforms**: - Added a system message randomly selected from a pool of generic system messages. ## berkeley-nest/Nectar **Filter**: - has an answer with `rank == 1` AND - `turns > 1` AND - `good_natured == True` AND - answer.to_lower() does not start with "i'm sorry" ## argilla/distilabel-math-preference-dpo **Filter**: - `chosen-rating >= 9` **Transforms**: - Added a system message randomly selected from a pool of generic system messages. - Removed the phrase "Take a deep breath, think step by step, and give an accurate response" ## Global formatting All the above datasets were formatted to comply with ChatML.

提供机构：

andysalerno

原始信息汇总

数据集信息

特征

source: 数据来源，类型为字符串。
input: 输入内容，类型为字符串。
output: 输出内容，类型为字符串。

数据分割

train: 训练集，包含148,983,951字节，69,980个样本。

数据大小

下载大小: 70,573,434字节
数据集大小: 148,983,951字节

配置

default: 默认配置，包含训练集数据文件路径为data/train-*。

数据集生成细节

数据集来源

argilla/ultrafeedback-binarized-preferences-cleaned:
- 过滤条件: chosen-rating == 5 且 len(chosen) == 2
heegyu/glaive-function-calling-v2-formatted:
- 过滤条件: function_description !=
- 转换操作: 随机添加一个系统消息。
berkeley-nest/Nectar:
- 过滤条件: 包含一个rank == 1的答案，且turns > 1，good_natured == True，答案不以"Im sorry"开头。
argilla/distilabel-math-preference-dpo:
- 过滤条件: chosen-rating >= 9
- 转换操作: 随机添加一个系统消息，移除短语"Take a deep breath, think step by step, and give an accurate response"。

全局格式

所有上述数据集都已格式化为符合ChatML标准。

5,000+

优质数据集

54 个

任务类型

进入经典数据集