distilabel-internal-testing/fine-preferences-magpie-v6-tasky

Name: distilabel-internal-testing/fine-preferences-magpie-v6-tasky
Creator: distilabel-internal-testing
Published: 2024-07-17 11:59:11
License: 暂无描述

Hugging Face2024-07-17 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/distilabel-internal-testing/fine-preferences-magpie-v6-tasky

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是通过distilabel工具生成的，包含一个`pipeline.yaml`文件，可以用于重现生成该数据集的流程。数据集的结构包括多个特征，如文本、ID、URL、文件路径、语言、语言评分、token计数、评分、整数评分、系统提示、对话内容、生成对话模型名称、生成内容、distilabel元数据等。数据集的一个示例展示了对话内容，涉及Gamma Ray Bursts（GRBs）对星际介质的影响及其对星系演化的意义。

This dataset contains a `pipeline.yaml` file which can be used to reproduce the pipeline that generated it in distilabel using the `distilabel` CLI. The dataset structure includes various features such as text, id, dump, url, file_path, language, language_score, token_count, score, int_score, system_prompt, conversation, gen_conv_model_name, generations, distilabel_metadata, and generations_model_names. The dataset is split into a training set with 100 examples. It is tagged as synthetic, distilabel, and rlaif.

提供机构：

distilabel-internal-testing

原始信息汇总

数据集概述

数据集结构

特征

text: 文本内容，数据类型为字符串。
id: 唯一标识符，数据类型为字符串。
dump: 数据转储信息，数据类型为字符串。
url: 数据来源的URL，数据类型为字符串。
file_path: 文件路径，数据类型为字符串。
language: 语言标识，数据类型为字符串。
language_score: 语言得分，数据类型为浮点数。
token_count: 标记数量，数据类型为整数。
score: 评分，数据类型为浮点数。
int_score: 整数评分，数据类型为整数。
system_prompt: 系统提示，数据类型为字符串。
conversation: 对话列表，包含以下子特征：
- content: 对话内容，数据类型为字符串。
- role: 角色，数据类型为字符串。
gen_conv_model_name: 生成对话的模型名称，数据类型为字符串。
generations: 生成的序列，数据类型为字符串序列。
distilabel_metadata: 元数据结构，包含以下子特征：
- raw_output_chat_generation_2: 原始输出聊天生成内容，数据类型为字符串。
generations_model_names: 生成模型名称序列，数据类型为字符串序列。

数据分割

train: 训练集，包含100个样本，总字节数为1738104。

数据集大小

下载大小: 876872字节
数据集大小: 1738104字节

配置

config_name: default
- data_files:
  - split: train
  - path: data/train-*

distilabel-internal-testing/fine-preferences-magpie-v6-tasky

数据集概述

数据集结构

特征

数据分割

数据集大小

配置

标签