cosmopedia

Name: cosmopedia
Creator: maas
Published: 2026-05-01 12:58:04
License: 暂无描述

魔搭社区2026-05-01 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/swift/cosmopedia

下载链接

链接失效反馈

官方服务：

资源简介：

# Cosmopedia v0.1 <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/8a9ZTW8sC4utjEPIrZegN.png" alt="Cosmopedia v0.1" width="600" height="300"> Image generated by DALL-E, the <a href="https://huggingface.co/datasets/HuggingFaceTB/miscellaneous/blob/main/cosmopedia_dalle_prompt_by_mixtral.txt">prompt</a> was generated by Mixtral-8x7B-Instruct-v0.1 </center> **Note: Cosmopedia v0.2 is available at [smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)** ``` User: What do you think "Cosmopedia" could mean? Hint: in our case it's not related to cosmology. Mixtral-8x7B-Instruct-v0.1: A possible meaning for "Cosmopedia" could be an encyclopedia or collection of information about different cultures, societies, and topics from around the world, emphasizing diversity and global connectedness. ``` **Cosmopedia** is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1).The dataset contains over **30 million files** and **25 billion tokens**, making it the largest open synthetic dataset to date. It covers a variety of topics; we tried to map world knowledge present in Web datasets like [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) and [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), and generate synthetic content that covers them. This is the v0.1 of Cosmopedia, with ample room for improvement and topics to be more comprehensively covered. We hope this dataset will help the community's research efforts in the increasingly intriguing domain of synthetic data. You can find a clickable map by Nomic at [https://atlas.nomic.ai/map/cosmopedia](https://atlas.nomic.ai/map/cosmopedia). This work is inspired by the great work of [Phi1.5](https://huggingface.co/papers/2309.05463). You can find more details about the dataset in our **blog post**: https://huggingface.co/blog/cosmopedia # TL;DR This is a synthetic dataset of 30M samples generated by [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1). It contains 8 splits depending on the source of the seed samples we use in the prompts, the model is asked to generate content related to them. The splits range from web samples to educational resources like Stanford, OpenStax and KhanAcademy, we also use some instruction-tuning datasets as seed samples for stories. Here's how you can load a dataset split: ```python from datasets import load_dataset ds = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train", num_proc=12) ds[0] ``` If you want a smaller subset of the dataset check [Cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k). We also trained a 1.8B model on Cosmopedia [Cosmo-1B](https://huggingface.co/HuggingFaceTB/cosmopedian-1b). # Dataset splits The prompts are all based on the concept of using a seed sample (for example an extract from a web page) and asking the model to generate new content (textbook, story, blogpost..) related to that seed sample. The dataset consist of 8 splits depending on the source of the seed data used in the split. Some seed samples may appear more than once when we ask for a different style (e.g academic textbook vs blogpost) or audience (e.g young children vs college students). For example, each sample in `stanford` was used with 4 different prompt styles and audiences, check the `format` and `audience` columns for more details. We observed that tailoring the audience and prompt style accordingly significantly enhances diversity; the proportion of duplicates eliminated via MinHash was under 1%. The graph below shows the distribution of seed datasets, generations formats and audiences in Cosmopedia: <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/V7MGV2OrCfLO5TxKPUXs4.png" alt="distributions" width="1000" height="500"> </center> Below are the 8 splits: - `web_samples_v1`: this and `web_samples_v2` are the largest splits (they make up~75% of the dataset), where we use samples from an internal web dataset similar to [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). These samples were selected based on their topic, using a clustering method explained in the section below. - `web_samples_v2`: similar to `web_samples_v2` using different samples. We call it v2 because we refined the prompts for this split (e.g asking for more depth over breadth in the concepts explanations and requesting the model to not generate a title and introductory sentences, which might be redundant across samples). - `stanford`: we scraped course outlines from [stanford.edu](https://explorecourses.stanford.edu/search?q=all%20courses), and each time we prompt the model with one of the course units. - `stories`: we generated stories to add some commonsense and day-to-day knowledge aspect to the dataset. For this split we use samples from [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) -only questions about the world [subset](https://huggingface.co/datasets/loubnabnl/ultrachat_questions_about_world)- and [OpenHermes2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5). These are synthetic instruction-tuning datasets that are already curated and cover a wide range of topics. - `wikihow`: in this split, we asked the model to generate WikiHow articles from WikiHow titles that we scraped, the list is avilable [here](https://github.com/huggingface/cosmopedia/blob/main/prompts/wikihow/wikihowcom-20231012-titles.txt). Note that you can find more WikiHow articles in the other splits by looking for it in the `format` column. - `openstax`: we scraped course outlines with unit introductions from [OpenStax](https://openstax.org/), a resource suggested by [AFAIK](https://afaik.io/) team. - `khanacademy`: we scraped the outlines for the courses on [KhanAcademy](https://www.khanacademy.org), and asked the model to genrate a textbook for each. - `automathtext`: to improve the science knowledge of the model, we use samples from [AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText/) dataset as seed samples. The dataset covers more than just math. See this clustering [plot](https://huggingface.co/datasets/HuggingFaceTB/miscellaneous/blob/main/AMT_plots/topics_distpng.png) we made. ### Dataset features The dataset has the following features: - prompt: the prompt we used to generate the content with Mixtral-8x7B-Instruct-v0.1. - text: the synthetic generated content. - seed_data: the prompts include some text fromanother dataset/an external source, `seed_data` is the name of that dataset (e.g web, Stanford courses...) - token_length: the number of tokens in `text`, computed using [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)'s tokenizer - format: the style of `text`, this can for example be a textbook, a blogpost, a story.. It can also be inferred from the prompt. - audience: the target audience defined in the prompt # Dataset creation The "Dataset splits" section already provides an overview of the data creation pipeline. In this section, we will explain the topic clustering method for web samples and our iterative process for refining the prompts, in addition to decontamination. ### Topic clustering Our goal was to generate a vast quantity of synthetic data covering a wide range of topics (essentially, anything useful found on the web) in a cleaner format like textbooks. A natural strategy was to begin with web samples, using them as seeds for the generation. This approach, employed by Li et al. in [Phi-1.5](https://huggingface.co/papers/2309.05463), appears to be the most scalable method for synthetic data generation, given the availability of web datasets with trillions of tokens. The prompted model will use an extract from these seed samples as a reference for generation, so the topic might matter more than the actual content of the file. To filter out less relevant topics and to provide the model with context for generating content, we first clustered millions of files from a web dataset. Then we prompted Mixtral 8x7B with extracts from 10 random samples in each cluster and asked it to find the topic they have in common and to provide an educational score for that topic. The dataset with clusters and topics is available in this [demo](https://huggingface.co/spaces/HuggingFaceTB/inspect_web_clusters), the code is available in [text-clustering]( https://github.com/huggingface/text-clustering ) and a [demo](https://huggingface.co/spaces/HuggingFaceTB/inspect_web_clusters) for inspection. The educational score seems to work for "very uneducational" topics like adult content and "highly educational" topics like College Mathematics, but isn't very relevant in-between. So we manually inspect the 145 clusters we find, and discard 35 of them. The final list of topics is available [here](https://github.com/huggingface/cosmopedia/blob/dd5cd1f7fcfae255c9cfbe704ba2187965523457/prompts/web_samples/filter_and_classify_clusters.py#L8). We don't do any further filtering inside the clusters but we include the topic of the sample in the prompt 100% of the time for `web_samples_v1`, but only 50% of the time in `web_samples_v2`, where we tried to refine the prompts, in case the topic isn't accurate or the topic list isn't comprehensive. Below are the clusters found in Cosmopedia: <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/jMKGaE_UnEfH3j8iZYXVN.png" alt="Cosmopedia clusters" width="1200" height="750"> Cosmopedia clusters. </center> ### Diversity We find that when using the same seed sample multiple times, changing the generation style and/or the audience and their target format results in different generations, covering the same topic from different angles. For example when asking the model for a children's textbook, we needed to remind it that it can't use complex concepts and that the tone should be adapted to children. The same goes when asking for textbooks for college students vs for researchers, we had to emphasize the level of depth we wanted for each, and how acadmeic the textbooks should be. By carefully iterating on the prompts using [HuggingChat](https://huggingface.co/chat/) and then generating few hundreds samples, we managed to reduce the redundancy. For example, we noticed that the model always started the stories with "Once upon a time" and the forums posts with "A few years back", asking it to explicitly avoid these sentences when starting the generation results in more diverse beginnings (don't worry "Once upon a time" still appears in stories!). Same goes for blogposts and textbooks where the introductory sentences were initially repetitive. Running MinHash deduplication on the splits detects less than 1% of the files as duplicates. ### Decontamination Given how we generate synthetic content, there is a possibility that the seed samples or the model's training data could have benchmarks contamination. Therefore, we run a decontamination piepline to make sure we don't have any samples from the test benchmarks in our dataset. We use a 10-gram overlap to retrieve potentially contaminated samples, similarly to [Phi-1](https://huggingface.co/papers/2306.11644). After retrieving the candidates, we run a diff between the dataset sample and the benchmark sample using `difflib.SequenceMatcher` and discard the sample if `len(matched_substrings)/len(benchmark_sample) > 0.5`. We run decontamination against all the benchmarks we evaluated the Cosmo-1B model on: MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-easy, ARC-challenge. We report the number of contaminated samples removed from each dataset split, as well as the number of unique benchmark samples that they correspond to (in brackets): | Dataset group | ARC Easy | ARC Challenge | BoolQ | HellaSwag | MMLU | OpenBookQA | PIQA | WinoGrande | |-----------------------------------------------|----------|---------------|----------------|-----------|------|------------|------|------------| | web_samples_v1 + web_samples_v2 + stanford + openstax | 30 (13) | 19 (3) | 386 (41) | 6 (5) | 1 (1) | 0 (0) | 5 (3) | 0 (0) | | auto_math_text + khanacademy | 4 (4) | 13 (2) | 34 (7) | 1 (1) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | | stories | 33 (20) | 20 (12) | 27 (21) | 3 (3) | 1 (1) | 2 (2) | 6 (4) | 3 (2) | ## Code The code for topic clustering of the web samples, building the prompts, content generation and data deduplication & decontamination can be found in the [Cosmopedia GitHub repository](https://github.com/huggingface/cosmopedia). ## Citation ``` @software{benallal2024cosmopedia, author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro}, title = {Cosmopedia}, month = February, year = 2024, url = {https://huggingface.co/datasets/HuggingFaceTB/cosmopedia} } ```

# Cosmopedia v0.1 <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/8a9ZTW8sC4utjEPIrZegN.png" alt="Cosmopedia v0.1" width="600" height="300"> 该图像由DALL-E生成，<a href="https://huggingface.co/datasets/HuggingFaceTB/miscellaneous/blob/main/cosmopedia_dalle_prompt_by_mixtral.txt">提示词（prompt）</a>由Mixtral-8x7B-Instruct-v0.1生成 </center> **注意：Cosmopedia v0.2版本可在[smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus)获取** 用户：你认为“Cosmopedia”可能是什么意思？提示：在本数据集中，它与宇宙学（cosmology）无关。 Mixtral-8x7B-Instruct-v0.1：“Cosmopedia”的一种可能含义是百科全书或全球范围内不同文化、社会与主题的信息合集，着重强调多样性与全球互联性。 **Cosmopedia** 是由[Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)生成的合成教科书、博客文章、故事、帖子以及WikiHow文章的数据集。该数据集包含超过**3000万个文件**与**250亿个Token**，是目前规模最大的开源合成数据集。该数据集涵盖多样主题；我们旨在复现[RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)与[RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)等网页数据集所包含的全球知识，并生成覆盖这些主题的合成内容。本版本为Cosmopedia v0.1，仍有较大改进空间，主题覆盖范围有待进一步完善。我们期望本数据集能够助力社区在日益引人关注的合成数据领域开展研究。你可以通过Nomic制作的可交互地图查看：[https://atlas.nomic.ai/map/cosmopedia](https://atlas.nomic.ai/map/cosmopedia)。本工作的灵感来源于[Phi1.5](https://huggingface.co/papers/2309.05463)的优秀研究。你可以在我们的**官方博客**中获取该数据集的更多细节：https://huggingface.co/blog/cosmopedia # 速览本数据集为由[Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1)生成的3000万个样本的合成数据集。根据提示词中使用的种子样本来源的不同，数据集包含8个拆分子集。子集范围涵盖网页样本与斯坦福、OpenStax、可汗学院（KhanAcademy）等教育资源，我们还使用了部分指令微调数据集作为故事生成的种子样本。以下为加载数据集拆分的示例代码： python from datasets import load_dataset ds = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train", num_proc=12) ds[0] 若你需要该数据集的小型子集，可以查看[Cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k)。我们还基于Cosmopedia训练了一个18亿参数的模型[Cosmo-1B](https://huggingface.co/HuggingFaceTB/cosmopedian-1b)。 # 数据集拆分所有提示词均基于以下思路：以一个种子样本（例如网页片段）作为参考，要求模型生成与该种子样本相关的新内容（如教科书、故事、博客文章等）。数据集共包含8个拆分子集，具体取决于该子集使用的种子数据来源。当我们要求模型以不同风格（例如学术教科书 vs 博客文章）或面向不同受众（例如儿童 vs 大学生）生成内容时，部分种子样本可能会被多次使用。例如，`stanford`子集中的每个样本均搭配4种不同的提示词风格与受众设置，更多细节请查看`format`与`audience`字段。我们发现，针对受众与提示词风格进行定制化调整可显著提升数据集多样性；通过MinHash去重后移除的重复样本占比低于1%。下图展示了Cosmopedia中种子数据集、生成格式与受众的分布情况： <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/V7MGV2OrCfLO5TxKPUXs4.png" alt="分布情况" width="1000" height="500"> </center> 以下为8个拆分子集： - `web_samples_v1`：该子集与`web_samples_v2`为规模最大的两个子集（合计占数据集的约75%），我们使用了与[RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)类似的内部网页数据集样本。这些样本基于主题进行筛选，筛选方法将在下文的聚类章节中详细说明。 - `web_samples_v2`：与`web_samples_v1`为同类型子集，但采用了不同的样本。我们将其命名为v2是因为针对该子集优化了提示词设计，例如要求模型在概念阐释中侧重深度而非广度，并要求模型不生成标题与引言句，以避免不同样本间出现冗余内容。 - `stanford`：我们从[斯坦福大学课程官网](https://explorecourses.stanford.edu/search?q=all%20courses)爬取了课程大纲，并每次以一个课程单元作为提示词向模型发起生成请求。 - `stories`：我们生成了故事类内容，为数据集增添常识与日常知识维度。该子集使用的样本来源于[UltraChat](https://huggingface.co/datasets/stingning/ultrachat)的「全球问题子集合」（[subset](https://huggingface.co/datasets/loubnabnl/ultrachat_questions_about_world)）以及[OpenHermes2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5)。二者均为已经过整理的合成指令微调数据集，涵盖广泛的主题范围。 - `wikihow`：在该子集中，我们基于爬取的WikiHow标题要求模型生成WikiHow风格的文章，标题列表可在[此处](https://github.com/huggingface/cosmopedia/blob/main/prompts/wikihow/wikihowcom-20231012-titles.txt)查看。请注意，你可以通过在其他子集的`format`字段中检索，找到更多WikiHow风格的文章。 - `openstax`：我们从[OpenStax](https://openstax.org/)爬取了带有单元引言的课程大纲，该资源由[AFAIK](https://afaik.io/)团队推荐。 - `khanacademy`：我们从[可汗学院（KhanAcademy）](https://www.khanacademy.org)爬取了课程大纲，并要求模型为每个大纲生成对应的教科书内容。 - `automathtext`：为提升模型的科学知识储备，我们使用[AutoMathText](https://huggingface.co/datasets/math-ai/AutoMathText/)数据集的样本作为种子样本。该数据集的覆盖范围不仅限于数学学科，你可以查看我们制作的[聚类分布图](https://huggingface.co/datasets/HuggingFaceTB/miscellaneous/blob/main/AMT_plots/topics_distpng.png)了解详情。 ### 数据集特征该数据集包含以下特征字段： - `prompt`：用于向Mixtral-8x7B-Instruct-v0.1发起内容生成请求的提示词。 - `text`：模型生成的合成内容。 - `seed_data`：提示词中包含的外部数据集/其他来源的文本片段，`seed_data`字段用于标注该来源的名称（例如网页、斯坦福课程等）。 - `token_length`：`text`字段内容的Token数量，通过[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)的Tokenizer计算得到。 - `format`：生成内容的风格，例如教科书、博客文章、故事等，也可通过提示词推断得出。 - `audience`：提示词中定义的目标受众群体。 # 数据集构建流程 “数据集拆分”章节已对数据构建流程进行了概述。本章节将进一步说明网页样本的主题聚类方法、提示词的迭代优化流程，以及数据去污染步骤。 ### 主题聚类我们的目标是生成海量的合成数据，覆盖广泛的主题（本质上即网页中所有有价值的内容），并采用教科书这类更规范的格式进行存储。一个自然的策略是以网页样本作为种子，启动内容生成流程。该方法由Li等人在[Phi-1.5](https://huggingface.co/papers/2309.05463)的研究中提出，考虑到万亿Token级网页数据集的可用性，该方法似乎是合成数据生成中最具可扩展性的方案。被提示的模型将以这些种子样本的片段作为生成参考，因此主题相比文件的实际内容更为重要。为过滤掉相关性较低的主题，并为模型提供生成内容所需的上下文，我们首先对网页数据集中的数百万个文件进行了聚类处理。随后我们从每个聚类中随机抽取10个样本的片段，将其作为提示词发送给Mixtral 8x7B，要求模型找出这些样本的共同主题，并为该主题给出教育价值评分。带有聚类与主题信息的数据集可在[此演示空间](https://huggingface.co/spaces/HuggingFaceTB/inspect_web_clusters)查看，相关代码可在[text-clustering](https://github.com/huggingface/text-clustering)仓库获取，同时也提供了[演示空间](https://huggingface.co/spaces/HuggingFaceTB/inspect_web_clusters)用于主题检视。该教育价值评分在“极低教育价值”的主题（例如成人内容）与“极高教育价值”的主题（例如大学数学）上表现良好，但在中间区间的主题上区分度不足。因此我们对得到的145个聚类进行了人工检视，并剔除了其中35个聚类。最终的主题列表可在[此处](https://github.com/huggingface/cosmopedia/blob/dd5cd1f7fcfae255c9cfbe704ba2187965523457/prompts/web_samples/filter_and_classify_clusters.py#L8)查看。我们未在聚类内部进行额外的过滤，但在`web_samples_v1`子集中，我们始终将样本主题加入提示词；而在`web_samples_v2`子集中，考虑到我们对提示词进行了优化，仅在50%的情况下将主题加入提示词，以避免主题描述不准确或主题列表覆盖不全的问题。以下为Cosmopedia中得到的聚类结果： <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/61c141342aac764ce1654e43/jMKGaE_UnEfH3j8iZYXVN.png" alt="Cosmopedia聚类结果" width="1200" height="750"> Cosmopedia聚类结果。 </center> ### 多样性优化我们发现，当重复使用同一种子样本时，调整生成风格、受众与目标格式可得到不同的生成结果，从不同视角覆盖同一主题。例如，当要求模型生成儿童教科书时，我们需要提醒模型避免使用复杂概念，并调整语气以适配儿童受众。同理，在为大学生与研究人员生成教科书时，我们需要分别强调所需的内容深度与学术严谨性。我们通过[HuggingChat](https://huggingface.co/chat/)对提示词进行了多轮迭代优化，并生成了数百个样本用于验证，最终成功降低了内容冗余度。例如，我们发现模型生成故事时总是以“很久很久以前”开头，生成论坛帖子时总是以“几年前”开头；通过明确要求模型避免使用这类起始句，生成内容的开头变得更加多样（当然，“很久很久以前”仍会在部分故事中出现！）。博客文章与教科书的引言句也曾存在类似的重复问题，我们通过优化提示词解决了该问题。对各子集运行MinHash去重算法后，检测到的重复文件占比低于1%。 ### 数据去污染鉴于我们的合成内容生成方式，种子样本或模型训练数据可能会与基准测试集产生污染。因此我们搭建了去污染流程，确保数据集中不包含任何测试基准集的样本。我们采用10-gram重叠的方法检索潜在污染样本，该方法参考了[Phi-1](https://huggingface.co/papers/2306.11644)的研究。检索到候选样本后，我们使用`difflib.SequenceMatcher`计算数据集样本与基准样本的差异，若`匹配子串总长度/基准样本总长度 > 0.5`，则剔除该样本。我们针对所有用于评估Cosmo-1B模型的基准测试集进行了去污染处理，包括：MMLU、HellaSwag、PIQA、SIQA、Winogrande、OpenBookQA、ARC-easy与ARC-challenge。我们列出了从各数据集拆分中移除的污染样本数量，以及其对应的唯一基准样本数量（括号内）： | 数据集分组 | ARC 简单集 | ARC 挑战集 | BoolQ | HellaSwag | MMLU | OpenBookQA | PIQA | Winogrande | |-----------------------------------------------|----------|---------------|----------------|-----------|------|------------|------|------------| | web_samples_v1 + web_samples_v2 + stanford + openstax | 30 (13) | 19 (3) | 386 (41) | 6 (5) | 1 (1) | 0 (0) | 5 (3) | 0 (0) | | auto_math_text + khanacademy | 4 (4) | 13 (2) | 34 (7) | 1 (1) | 0 (0) | 0 (0) | 0 (0) | 0 (0) | | stories | 33 (20) | 20 (12) | 27 (21) | 3 (3) | 1 (1) | 2 (2) | 6 (4) | 3 (2) | ## 代码仓库网页样本主题聚类、提示词构建、内容生成以及数据去重与去污染的相关代码可在[Cosmopedia GitHub仓库](https://github.com/huggingface/cosmopedia)获取。 ## 引用格式 @software{benallal2024cosmopedia, author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro}, title = {Cosmopedia}, month = February, year = 2024, url = {https://huggingface.co/datasets/HuggingFaceTB/cosmopedia} }

提供机构：

maas

创建时间：

2024-06-05

搜集汇总

数据集介绍

背景与挑战

背景概述

Cosmopedia v0.1是一个由Mixtral-8x7B-Instruct-v0.1生成的合成数据集，包含超过3000万个文件和250亿个令牌，覆盖教科书、博客文章、故事和WikiHow文章等多种主题。该数据集基于不同种子数据源分为8个分割，并进行了去重和去污染处理，旨在为合成数据研究提供支持。

以上内容由遇见数据集搜集并总结生成