Taywon/webgpt_noisy
收藏Hugging Face2024-06-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Taywon/webgpt_noisy
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: question
struct:
- name: dataset
dtype: string
- name: full_text
dtype: string
- name: id
dtype: string
- name: quotes_0
struct:
- name: extract
sequence: string
- name: title
sequence: string
- name: answer_0
dtype: string
- name: tokens_0
struct:
- name: completion
sequence: int64
- name: prefix
sequence: int64
- name: score_0
dtype: float64
- name: quotes_1
struct:
- name: extract
sequence: string
- name: title
sequence: string
- name: answer_1
dtype: string
- name: tokens_1
struct:
- name: completion
sequence: int64
- name: prefix
sequence: int64
- name: score_1
dtype: float64
- name: input_ids_chosen
sequence: int64
- name: attention_mask_chosen
sequence: int64
- name: input_ids_rejected
sequence: int64
- name: attention_mask_rejected
sequence: int64
splits:
- name: train_noise_60
num_bytes: 312172618
num_examples: 12663
- name: train_noise_20
num_bytes: 312172618
num_examples: 12663
download_size: 196351590
dataset_size: 624345236
configs:
- config_name: default
data_files:
- split: train_noise_60
path: data/train_noise_60-*
- split: train_noise_20
path: data/train_noise_20-*
---
# Dataset Card for "webgpt_noisy"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
The dataset webgpt_noisy includes multiple features such as questions, quotes, answers, tokens, scores, input IDs, and attention masks. Each feature has detailed structure and data type descriptions. The dataset is divided into two noise level training sets, each with specific byte count and example numbers. The download size and total size of the dataset are also clearly stated.
提供机构:
Taywon
原始信息汇总
数据集概述
数据集名称
- 名称: webgpt_noisy
数据集特征
- 问题 (question)
- 结构:
- 数据集 (dataset): 字符串
- 全文 (full_text): 字符串
- ID (id): 字符串
- 结构:
- 引用_0 (quotes_0)
- 结构:
- 提取 (extract): 字符串序列
- 标题 (title): 字符串序列
- 结构:
- 回答_0 (answer_0): 字符串
- 标记_0 (tokens_0)
- 结构:
- 完成 (completion): 整数64序列
- 前缀 (prefix): 整数64序列
- 结构:
- 分数_0 (score_0): 浮点64
- 引用_1 (quotes_1)
- 结构:
- 提取 (extract): 字符串序列
- 标题 (title): 字符串序列
- 结构:
- 回答_1 (answer_1): 字符串
- 标记_1 (tokens_1)
- 结构:
- 完成 (completion): 整数64序列
- 前缀 (prefix): 整数64序列
- 结构:
- 分数_1 (score_1): 浮点64
- 输入ID_选定 (input_ids_chosen): 整数64序列
- 注意力掩码_选定 (attention_mask_chosen): 整数64序列
- 输入ID_拒绝 (input_ids_rejected): 整数64序列
- 注意力掩码_拒绝 (attention_mask_rejected): 整数64序列
数据集分割
- 训练噪声_60 (train_noise_60)
- 字节数: 312172618
- 示例数: 12663
- 训练噪声_20 (train_noise_20)
- 字节数: 312172618
- 示例数: 12663
数据集大小
- 下载大小: 196351590 字节
- 数据集大小: 624345236 字节
配置
- 默认配置 (default)
- 数据文件:
- 分割: train_noise_60
- 路径: data/train_noise_60-*
- 分割: train_noise_20
- 路径: data/train_noise_20-*
- 分割: train_noise_60
- 数据文件:



