tasksource/oasst1_dense_flat
收藏Hugging Face2023-05-31 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tasksource/oasst1_dense_flat
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: message_id
dtype: string
- name: parent_id
dtype: string
- name: user_id
dtype: string
- name: created_date
dtype: string
- name: text
dtype: string
- name: role
dtype: string
- name: lang
dtype: string
- name: review_count
dtype: int32
- name: review_result
dtype: bool
- name: deleted
dtype: bool
- name: rank
dtype: float64
- name: synthetic
dtype: bool
- name: model_name
dtype: 'null'
- name: detoxify
struct:
- name: identity_attack
dtype: float64
- name: insult
dtype: float64
- name: obscene
dtype: float64
- name: severe_toxicity
dtype: float64
- name: sexual_explicit
dtype: float64
- name: threat
dtype: float64
- name: toxicity
dtype: float64
- name: message_tree_id
dtype: string
- name: tree_state
dtype: string
- name: emojis
struct:
- name: count
sequence: int32
- name: name
sequence: string
- name: labels
struct:
- name: count
sequence: int32
- name: name
sequence: string
- name: value
sequence: float64
- name: parent_text
dtype: string
- name: spam
dtype: float64
- name: fails_task
dtype: float64
- name: lang_mismatch
dtype: float64
- name: pii
dtype: float64
- name: not_appropriate
dtype: float64
- name: hate_speech
dtype: float64
- name: sexual_content
dtype: float64
- name: quality
dtype: float64
- name: toxicity
dtype: float64
- name: humor
dtype: float64
- name: helpfulness
dtype: float64
- name: creativity
dtype: float64
- name: violence
dtype: float64
splits:
- name: train
num_bytes: 59657796
num_examples: 34059
- name: validation
num_bytes: 3164029
num_examples: 1816
download_size: 25173939
dataset_size: 62821825
license: apache-2.0
---
# Dataset Card for "oasst1_dense_flat"
[OASST1 dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)
But where with retrieved parent_text, and where we only keep messages with dense annotations (all labels have 2 annotators)
```python
from datasets import Dataset, DatasetDict
d={}
for split in ['train','validation']:
df=load_dataset("OpenAssistant/oasst1")[split].to_pandas()
m2t=df.set_index("message_id")['text'].to_dict()
df['parent_text']=df.parent_id.map(lambda x: m2t.get(x,''))
df=df[df.labels.map(lambda x:x!=None)]
df=df[df.labels.map(lambda x:x['count'].min()>2)]
labels=df.labels.map(lambda x:list(x['name'])).value_counts().index[0]
df=df[df.labels.map(lambda x:x!=None)]
df=df[df.labels.map(lambda x:list(x['name'])==labels)]
for label in labels:
df[label]=df.labels.map(lambda x: x['value'][list(x['name']).index(label)])
d[split]=Dataset.from_pandas(df,preserve_index=False)
DatasetDict(d).push_to_hub('oasst1_dense_flat')
```
https://github.com/LAION-AI/Open-Assistant
```
@article{kopf2023openassistant,
title={OpenAssistant Conversations--Democratizing Large Language Model Alignment},
author={K{\"o}pf, Andreas and Kilcher, Yannic and von R{\"u}tte, Dimitri and Anagnostidis, Sotiris and Tam, Zhi-Rui and Stevens, Keith and Barhoum, Abdullah and Duc, Nguyen Minh and Stanley, Oliver and Nagyfi, Rich{\'a}rd and others},
journal={arXiv preprint arXiv:2304.07327},
year={2023}
}
```
提供机构:
tasksource
原始信息汇总
数据集概述
数据集信息
特征
- message_id: 字符串类型
- parent_id: 字符串类型
- user_id: 字符串类型
- created_date: 字符串类型
- text: 字符串类型
- role: 字符串类型
- lang: 字符串类型
- review_count: 32位整数类型
- review_result: 布尔类型
- deleted: 布尔类型
- rank: 64位浮点数类型
- synthetic: 布尔类型
- model_name: null类型
- detoxify: 结构体类型,包含以下字段:
- identity_attack: 64位浮点数类型
- insult: 64位浮点数类型
- obscene: 64位浮点数类型
- severe_toxicity: 64位浮点数类型
- sexual_explicit: 64位浮点数类型
- threat: 64位浮点数类型
- toxicity: 64位浮点数类型
- message_tree_id: 字符串类型
- tree_state: 字符串类型
- emojis: 结构体类型,包含以下字段:
- count: 整数序列类型
- name: 字符串序列类型
- labels: 结构体类型,包含以下字段:
- count: 整数序列类型
- name: 字符串序列类型
- value: 64位浮点数序列类型
- parent_text: 字符串类型
- spam: 64位浮点数类型
- fails_task: 64位浮点数类型
- lang_mismatch: 64位浮点数类型
- pii: 64位浮点数类型
- not_appropriate: 64位浮点数类型
- hate_speech: 64位浮点数类型
- sexual_content: 64位浮点数类型
- quality: 64位浮点数类型
- toxicity: 64位浮点数类型
- humor: 64位浮点数类型
- helpfulness: 64位浮点数类型
- creativity: 64位浮点数类型
- violence: 64位浮点数类型
数据分割
- train:
- 字节数: 59657796
- 样本数: 34059
- validation:
- 字节数: 3164029
- 样本数: 1816
数据集大小
- 下载大小: 25173939
- 数据集大小: 62821825
许可证
- apache-2.0



