five

david-thrower/cosmopedia-100k-simple-text

收藏
Hugging Face2025-01-29 更新2025-02-15 收录
下载链接:
https://hf-mirror.com/datasets/david-thrower/cosmopedia-100k-simple-text
下载链接
链接失效反馈
官方服务:
资源简介:
Cosmopedia是一个大型合成数据集,由HuggingFaceTB创建,包含超过3000万个文件和2500亿个标记。这个数据集由Mixtral-8x7B-Instruct-v0.1生成,覆盖了广泛的话题,并分为8个部分,包括网页样本、教育资源和指令调整数据集。它旨在支持合成数据领域的研究工作。HuggingFaceTB/cosmopedia-100k是Cosmopedia的一个微型子集,适合用于小规模测试和训练小型模型。这个数据集被重构为单轮对话模型训练的简单字符串列表。

Cosmopedia is a large-scale synthetic dataset created by HuggingFaceTB, containing over 30 million files and 25 billion tokens. Generated by Mixtral-8x7B-Instruct-v0.1, it covers a wide range of topics and is divided into 8 splits, including web samples, educational resources, and instruction-tuning datasets as seed samples. It aims to support research efforts in the field of synthetic data. HuggingFaceTB/cosmopedia-100k provides a miniature subset of the dataset suitable for small-scale testing and training smaller models. This fork of the dataset is reformatted as a simple iterable list of strings for single-turn chat model training.
提供机构:
david-thrower
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作