five

Taywon/webgpt_noisy

收藏
Hugging Face2024-06-10 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Taywon/webgpt_noisy
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: question struct: - name: dataset dtype: string - name: full_text dtype: string - name: id dtype: string - name: quotes_0 struct: - name: extract sequence: string - name: title sequence: string - name: answer_0 dtype: string - name: tokens_0 struct: - name: completion sequence: int64 - name: prefix sequence: int64 - name: score_0 dtype: float64 - name: quotes_1 struct: - name: extract sequence: string - name: title sequence: string - name: answer_1 dtype: string - name: tokens_1 struct: - name: completion sequence: int64 - name: prefix sequence: int64 - name: score_1 dtype: float64 - name: input_ids_chosen sequence: int64 - name: attention_mask_chosen sequence: int64 - name: input_ids_rejected sequence: int64 - name: attention_mask_rejected sequence: int64 splits: - name: train_noise_60 num_bytes: 312172618 num_examples: 12663 - name: train_noise_20 num_bytes: 312172618 num_examples: 12663 download_size: 196351590 dataset_size: 624345236 configs: - config_name: default data_files: - split: train_noise_60 path: data/train_noise_60-* - split: train_noise_20 path: data/train_noise_20-* --- # Dataset Card for "webgpt_noisy" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

The dataset webgpt_noisy includes multiple features such as questions, quotes, answers, tokens, scores, input IDs, and attention masks. Each feature has detailed structure and data type descriptions. The dataset is divided into two noise level training sets, each with specific byte count and example numbers. The download size and total size of the dataset are also clearly stated.
提供机构:
Taywon
原始信息汇总

数据集概述

数据集名称

  • 名称: webgpt_noisy

数据集特征

  • 问题 (question)
    • 结构:
      • 数据集 (dataset): 字符串
      • 全文 (full_text): 字符串
      • ID (id): 字符串
  • 引用_0 (quotes_0)
    • 结构:
      • 提取 (extract): 字符串序列
      • 标题 (title): 字符串序列
  • 回答_0 (answer_0): 字符串
  • 标记_0 (tokens_0)
    • 结构:
      • 完成 (completion): 整数64序列
      • 前缀 (prefix): 整数64序列
  • 分数_0 (score_0): 浮点64
  • 引用_1 (quotes_1)
    • 结构:
      • 提取 (extract): 字符串序列
      • 标题 (title): 字符串序列
  • 回答_1 (answer_1): 字符串
  • 标记_1 (tokens_1)
    • 结构:
      • 完成 (completion): 整数64序列
      • 前缀 (prefix): 整数64序列
  • 分数_1 (score_1): 浮点64
  • 输入ID_选定 (input_ids_chosen): 整数64序列
  • 注意力掩码_选定 (attention_mask_chosen): 整数64序列
  • 输入ID_拒绝 (input_ids_rejected): 整数64序列
  • 注意力掩码_拒绝 (attention_mask_rejected): 整数64序列

数据集分割

  • 训练噪声_60 (train_noise_60)
    • 字节数: 312172618
    • 示例数: 12663
  • 训练噪声_20 (train_noise_20)
    • 字节数: 312172618
    • 示例数: 12663

数据集大小

  • 下载大小: 196351590 字节
  • 数据集大小: 624345236 字节

配置

  • 默认配置 (default)
    • 数据文件:
      • 分割: train_noise_60
        • 路径: data/train_noise_60-*
      • 分割: train_noise_20
        • 路径: data/train_noise_20-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作