UnderstandLing/oasst1_zh
收藏Hugging Face2023-12-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/UnderstandLing/oasst1_zh
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
dataset_info:
features:
- name: message_id
dtype: string
- name: parent_id
dtype: string
- name: user_id
dtype: string
- name: created_date
dtype: string
- name: text
dtype: string
- name: role
dtype: string
- name: lang
dtype: string
- name: review_count
dtype: int64
- name: review_result
dtype: bool
- name: deleted
dtype: bool
- name: rank
dtype: float64
- name: synthetic
dtype: bool
- name: model_name
dtype: 'null'
- name: detoxify
struct:
- name: identity_attack
dtype: float64
- name: insult
dtype: float64
- name: obscene
dtype: float64
- name: severe_toxicity
dtype: float64
- name: sexual_explicit
dtype: float64
- name: threat
dtype: float64
- name: toxicity
dtype: float64
- name: message_tree_id
dtype: string
- name: tree_state
dtype: string
- name: emojis
struct:
- name: count
sequence: int64
- name: name
sequence: string
- name: labels
struct:
- name: count
sequence: int64
- name: name
sequence: string
- name: value
sequence: float64
splits:
- name: train
num_bytes: 86692100
num_examples: 84432
- name: validation
num_bytes: 415393
num_examples: 399
download_size: 30971061
dataset_size: 87107493
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
The dataset includes multiple features such as message ID, parent ID, user ID, creation date, text content, role, language, review count, review result, deletion status, rank, synthetic flag, model name, detoxification analysis, message tree ID, tree state, emojis, and labels. The dataset is split into a training set with 84432 samples and a validation set with 399 samples. The download size of the dataset is 30971061 bytes, and the total size is 87107493 bytes.
提供机构:
UnderstandLing
原始信息汇总
数据集概述
许可证
- Apache 2.0
数据集信息
特征
- message_id: 字符串
- parent_id: 字符串
- user_id: 字符串
- created_date: 字符串
- text: 字符串
- role: 字符串
- lang: 字符串
- review_count: 整数
- review_result: 布尔值
- deleted: 布尔值
- rank: 浮点数
- synthetic: 布尔值
- model_name: null
- detoxify: 结构体
- identity_attack: 浮点数
- insult: 浮点数
- obscene: 浮点数
- severe_toxicity: 浮点数
- sexual_explicit: 浮点数
- threat: 浮点数
- toxicity: 浮点数
- message_tree_id: 字符串
- tree_state: 字符串
- emojis: 结构体
- count: 整数序列
- name: 字符串序列
- labels: 结构体
- count: 整数序列
- name: 字符串序列
- value: 浮点数序列
数据分割
- train:
- 字节数: 86692100
- 样本数: 84432
- validation:
- 字节数: 415393
- 样本数: 399
数据集大小
- 下载大小: 30971061 字节
- 数据集大小: 87107493 字节
配置
- default:
- train: data/train-*
- validation: data/validation-*



