five

andysalerno/rainbowfish-v1

收藏
Hugging Face2024-02-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/andysalerno/rainbowfish-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: source dtype: string - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 148983951 num_examples: 69980 download_size: 70573434 dataset_size: 148983951 configs: - config_name: default data_files: - split: train path: data/train-* --- ## Formatting Formatting is compliant with ChatML. "input" is the context and "output" is the expected model output to train on. ## Details See the repo file `generate_dataset.py` for exactly this dataset was generated. An opinionated and filtered mix of the following datasets: - argilla/ultrafeedback-binarized-preferences-cleaned - heegyu/glaive-function-calling-v2-formatted - berkeley-nest/Nectar - argilla/distilabel-math-preference-dpo ## argilla/ultrafeedback-binarized-preferences-cleaned **Filter**: - `chosen-rating == 5` AND - `len(chosen) == 2` ## heegyu/glaive-function-calling-v2-formatted **Filter**: - `function_description != ''` **Transforms**: - Added a system message randomly selected from a pool of generic system messages. ## berkeley-nest/Nectar **Filter**: - has an answer with `rank == 1` AND - `turns > 1` AND - `good_natured == True` AND - answer.to_lower() does not start with "i'm sorry" ## argilla/distilabel-math-preference-dpo **Filter**: - `chosen-rating >= 9` **Transforms**: - Added a system message randomly selected from a pool of generic system messages. - Removed the phrase "Take a deep breath, think step by step, and give an accurate response" ## Global formatting All the above datasets were formatted to comply with ChatML.
提供机构:
andysalerno
原始信息汇总

数据集信息

特征

  • source: 数据来源,类型为字符串。
  • input: 输入内容,类型为字符串。
  • output: 输出内容,类型为字符串。

数据分割

  • train: 训练集,包含148,983,951字节,69,980个样本。

数据大小

  • 下载大小: 70,573,434字节
  • 数据集大小: 148,983,951字节

配置

  • default: 默认配置,包含训练集数据文件路径为data/train-*

数据集生成细节

数据集来源

  • argilla/ultrafeedback-binarized-preferences-cleaned:

    • 过滤条件: chosen-rating == 5len(chosen) == 2
  • heegyu/glaive-function-calling-v2-formatted:

    • 过滤条件: function_description !=
    • 转换操作: 随机添加一个系统消息。
  • berkeley-nest/Nectar:

    • 过滤条件: 包含一个rank == 1的答案,且turns > 1good_natured == True,答案不以"Im sorry"开头。
  • argilla/distilabel-math-preference-dpo:

    • 过滤条件: chosen-rating >= 9
    • 转换操作: 随机添加一个系统消息,移除短语"Take a deep breath, think step by step, and give an accurate response"。

全局格式

所有上述数据集都已格式化为符合ChatML标准。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作