davanstrien/cosmopedia_chat

Name: davanstrien/cosmopedia_chat
Creator: davanstrien
Published: 2024-03-08 13:56:27
License: 暂无描述

Hugging Face2024-03-08 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/davanstrien/cosmopedia_chat

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: title dtype: string - name: generated_text dtype: string splits: - name: train num_bytes: 5362254 num_examples: 1188 download_size: 1902915 dataset_size: 5362254 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - text-generation language: - en tags: - synthetic pretty_name: Cosmopedia Chat size_categories: - 1K<n<10K --- # Dataset Card for Cosmopedia Chat <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/60107b385ac3e86b3ea4fc34/6mMBW7gBurVT6kYpjX9L8.png" alt="Your Image" width="500"> </p> ## Dataset Details ### Dataset Description Docs are WIP! Rough steps to produce this data. - Start with [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia) dataset - Select `khanacademy` config - filter by text length - remove some examples with certain text i.e. responses starting with "sure" - Extract a title from the original prompt in the dataset - Pass titlte + text to [NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/Genstruct-7B) to create user/chat pairs from this title + text context - Profit?? TODO: - More curation of what data is included to start. Some of the Cosmpoedia data is not very "stand alone" - Try and remove topics that are unlikely to be useful for chat data - Remove bad generations - Parse generations into users/assistant chat format.

提供机构：

davanstrien

原始信息汇总

数据集卡片 for Cosmopedia Chat

数据集详情

数据集描述

特征

prompt: 字符串类型
text_token_length: 64位整数类型
text: 字符串类型
seed_data: 字符串类型
format: 字符串类型
audience: 字符串类型
title: 字符串类型
generated_text: 字符串类型

分割

train: 包含1188个样本，总字节数为5362254

大小

下载大小: 1902915字节
数据集大小: 5362254字节

配置

default: 包含训练数据文件，路径为data/train-*

任务类别

文本生成

语言

英语

美观名称

Cosmopedia Chat

大小类别

1K<n<10K

5,000+

优质数据集

54 个

任务类型

进入经典数据集