davanstrien/cosmopedia_chat
收藏Hugging Face2024-03-08 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/davanstrien/cosmopedia_chat
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: title
dtype: string
- name: generated_text
dtype: string
splits:
- name: train
num_bytes: 5362254
num_examples: 1188
download_size: 1902915
dataset_size: 5362254
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- text-generation
language:
- en
tags:
- synthetic
pretty_name: Cosmopedia Chat
size_categories:
- 1K<n<10K
---
# Dataset Card for Cosmopedia Chat
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/6mMBW7gBurVT6kYpjX9L8.png" alt="Your Image" width="500">
</p>
## Dataset Details
### Dataset Description
Docs are WIP!
Rough steps to produce this data.
- Start with [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset
- Select `khanacademy` config
- filter by text length
- remove some examples with certain text i.e. responses starting with "sure"
- Extract a title from the original prompt in the dataset
- Pass titlte + text to [NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/Genstruct-7B) to create user/chat pairs from this title + text context
- Profit??
TODO:
- More curation of what data is included to start. Some of the Cosmpoedia data is not very "stand alone"
- Try and remove topics that are unlikely to be useful for chat data
- Remove bad generations
- Parse generations into users/assistant chat format.
提供机构:
davanstrien
原始信息汇总
数据集卡片 for Cosmopedia Chat
数据集详情
数据集描述
特征
- prompt: 字符串类型
- text_token_length: 64位整数类型
- text: 字符串类型
- seed_data: 字符串类型
- format: 字符串类型
- audience: 字符串类型
- title: 字符串类型
- generated_text: 字符串类型
分割
- train: 包含1188个样本,总字节数为5362254
大小
- 下载大小: 1902915字节
- 数据集大小: 5362254字节
配置
- default: 包含训练数据文件,路径为
data/train-*
任务类别
- 文本生成
语言
- 英语
标签
- 合成数据
美观名称
- Cosmopedia Chat
大小类别
- 1K<n<10K



