five

UnderstandLing/oasst1_zh

收藏
Hugging Face2023-12-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/UnderstandLing/oasst1_zh
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 dataset_info: features: - name: message_id dtype: string - name: parent_id dtype: string - name: user_id dtype: string - name: created_date dtype: string - name: text dtype: string - name: role dtype: string - name: lang dtype: string - name: review_count dtype: int64 - name: review_result dtype: bool - name: deleted dtype: bool - name: rank dtype: float64 - name: synthetic dtype: bool - name: model_name dtype: 'null' - name: detoxify struct: - name: identity_attack dtype: float64 - name: insult dtype: float64 - name: obscene dtype: float64 - name: severe_toxicity dtype: float64 - name: sexual_explicit dtype: float64 - name: threat dtype: float64 - name: toxicity dtype: float64 - name: message_tree_id dtype: string - name: tree_state dtype: string - name: emojis struct: - name: count sequence: int64 - name: name sequence: string - name: labels struct: - name: count sequence: int64 - name: name sequence: string - name: value sequence: float64 splits: - name: train num_bytes: 86692100 num_examples: 84432 - name: validation num_bytes: 415393 num_examples: 399 download_size: 30971061 dataset_size: 87107493 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* ---

The dataset includes multiple features such as message ID, parent ID, user ID, creation date, text content, role, language, review count, review result, deletion status, rank, synthetic flag, model name, detoxification analysis, message tree ID, tree state, emojis, and labels. The dataset is split into a training set with 84432 samples and a validation set with 399 samples. The download size of the dataset is 30971061 bytes, and the total size is 87107493 bytes.
提供机构:
UnderstandLing
原始信息汇总

数据集概述

许可证

  • Apache 2.0

数据集信息

特征

  • message_id: 字符串
  • parent_id: 字符串
  • user_id: 字符串
  • created_date: 字符串
  • text: 字符串
  • role: 字符串
  • lang: 字符串
  • review_count: 整数
  • review_result: 布尔值
  • deleted: 布尔值
  • rank: 浮点数
  • synthetic: 布尔值
  • model_name: null
  • detoxify: 结构体
    • identity_attack: 浮点数
    • insult: 浮点数
    • obscene: 浮点数
    • severe_toxicity: 浮点数
    • sexual_explicit: 浮点数
    • threat: 浮点数
    • toxicity: 浮点数
  • message_tree_id: 字符串
  • tree_state: 字符串
  • emojis: 结构体
    • count: 整数序列
    • name: 字符串序列
  • labels: 结构体
    • count: 整数序列
    • name: 字符串序列
    • value: 浮点数序列

数据分割

  • train:
    • 字节数: 86692100
    • 样本数: 84432
  • validation:
    • 字节数: 415393
    • 样本数: 399

数据集大小

  • 下载大小: 30971061 字节
  • 数据集大小: 87107493 字节

配置

  • default:
    • train: data/train-*
    • validation: data/validation-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作