Chinese Cosmopedia 中文合成百科数据集
收藏魔搭社区2026-05-24 更新2024-09-28 收录
下载链接:
https://modelscope.cn/datasets/opencsg/chinese-cosmopedia
下载链接
链接失效反馈官方服务:
资源简介:
Chinese Cosmopedia数据集共包含1500万条数据,约60B个token,构建合成数据集的两个核心要素是种子数据和prompt。种子数据决定了生成内容的主题,prompt则决定了数据的风格(如教科书、故事、教程或幼儿读物)。数据来源丰富多样,涵盖了中文维基百科、百度百科、知乎问答和技术博客等平台,确保内容的广泛性和权威性。生成的数据形式多样,涵盖大学教科书、中学教科书、幼儿故事、普通故事和WikiHow风格教程等多种不同风格。通过对每条种子数据生成多种不同风格的内容,数据集不仅适用于学术研究,还广泛应用于教育、娱乐和技术领域。
# **Chinese Cosmopedia Dataset** [[中文]](#chinese) [[English]](#english)
<a id="english"></a>
<p align="center">
<img width="600px" alt="OpenCSG" src="./logo.jpg">
</p>
<p align="center"><a href="https://portal.opencsg.com/models">[OpenCSG Community]</a> <a href="https://github.com/yuyijiong/fineweb-edu-chinese">[👾github]</a> <a href="https://cdn-uploads.huggingface.co/production/uploads/64c71b27d43e4dee51a8b31a/HU6vz21qKTEmUBCWqCFh9.jpeg">[wechat]</a> <a href="https://twitter.com/OpenCsg">[Twitter]</a> </p>
</div>
[📖Technical Report](https://arxiv.org/abs/2501.08197)
The Chinese Cosmopedia dataset contains a total of 15 million entries, approximately 60B tokens. The two core elements for constructing this synthetic dataset are seed data and prompts. Seed data determines the theme of the generated content, while prompts define the style of the data (such as textbooks, stories, tutorials, or children's books). The dataset draws from diverse sources, including Chinese Wikipedia, Baidu Baike, knowledge Q&A platforms, and technical blogs, ensuring both the breadth and authority of the content. The generated data covers various formats, including university textbooks, middle school textbooks, children's stories, general stories, and WikiHow-style tutorials. By generating multiple styles for each seed data entry, the dataset is not only suitable for academic research but also widely applicable in education, entertainment, and technology fields.
# Data Sources and Categories
The Chinese Cosmopedia dataset sources data from a variety of rich Chinese content platforms and knowledge bases, including:
- **Chinese Wikipedia**: Provides a large number of accurate, authoritative knowledge articles.
- **Baidu Baike**: As one of the most influential encyclopedia platforms in China, it supplies extensive Chinese knowledge resources for the dataset.
- **Zhihu Q&A**: Content extracted from this interactive Q&A platform, covering discussions and insights across various fields.
- **Technical Blogs**: Articles from technical communities, featuring in-depth discussions on topics ranging from programming to artificial intelligence.
These seed data sources form the core content foundation of the Chinese Cosmopedia dataset, ensuring coverage of knowledge across multiple domains.
# Data Formats and Styles
The Chinese Cosmopedia dataset places special emphasis on the style and format of the generated content, encompassing various text types from academic to practical applications. The main categories include:
- **University Textbooks**: Structured rigorously, delving into core concepts of various university-level disciplines.
- **Middle School Textbooks**: Educational content tailored for middle school students, concise and easy to understand, focusing on the transmission of basic knowledge.
- **Preschool Stories**: Designed for 5-year-old children, using simple language to help young kids understand the world and interpersonal relationships.
- **General Stories**: Using engaging plots and character dialogues to vividly illustrate specific concepts.
- **WikiHow-style Tutorials**: Detailed step-by-step guides to help users complete specific tasks.
Each text type has been carefully adjusted in style based on its application scenario and target audience. Through this design, the Cosmopedia dataset can be applied not only to academic research but also widely across fields such as education, entertainment, and technology.
# Statistics
Seed Data Sources: {'blog': 2,111,009, 'baike': 10,939,121, 'wiki': 173,671, 'knowledge QA': 2,291,547}
<p align="center">
<img width="900px" alt="OpenCSG" src="./datasource.png">
</p>
Data Formats: {'preschool story': 1,637,760, 'normal story': 3,332,288, 'middle school textbook': 4,677,397, 'college textbook': 3,127,902, 'wikihow': 2,740,001}
<p align="center">
<img width="600px" alt="OpenCSG" src="./datastyle.png">
</p>
# Data Generation and Model
The data generation of the Chinese Cosmopedia dataset is based on the proprietary OpenCSG-Wukong-Enterprise-Long model developed by the OpenCSG team. This model boasts robust long-text generation capabilities, ensuring the coherence and depth of the generated content. During the data generation process, the OpenCSG team designed specialized prompts for each text type and content category to ensure that the generated data's style matches its intended format. For example, prompts for textbook content guide the model to produce rigorous and in-depth academic texts, while prompts for story content inspire the creation of vivid and engaging narratives.
Prompts Used for Generating Various Data Formats:
#### University Textbook
Plain
This is an excerpt from a webpage: "{}".
Please write a sufficiently detailed textbook chapter for college students that relates to one or more concepts in the given excerpt. You do not need to include all content from the excerpt; only extract parts suitable for textbook content, and you may freely add other relevant knowledge. Do not merely list concepts; instead, thoroughly develop and explore each concept in detail, as we prioritize in-depth understanding of the topic over breadth. Requirements:
1. Rigor: Ensure comprehensive coverage of concepts/chapters.
2. Engagement: Write in an academic, professional, and compelling tone to attract interest.
3. Application: Incorporate specific practical examples. For instance, include formulas and rigorous proofs in calculus, key dates and figures in history, and code snippets in computer operations.
4. No references are needed. The content must not contain advertisements or privacy-related information.
Please note that the content is intended for college students, who may have some basic knowledge but are not experts in the field. The content should be detailed and thought-provoking.
Start writing the textbook immediately, do not use images, and only output the textbook content.
#### Middle School Textbook
Plain
Webpage excerpt: "{}".
Create educational content related to one or more concepts in the above webpage excerpt, targeting middle school students, and make it as long and detailed as possible. You may freely add other relevant knowledge. Do not merely list concepts; instead, thoroughly develop and explore each concept in detail, as we prioritize in-depth understanding of the topic over breadth, and you do not need to include all content from the excerpt.
Do not use complex college-level topics such as calculus, as these are generally not part of middle school curricula. If the topic falls into such categories, find a simpler scientific alternative to explain using everyday examples. For example, if the topic is "linear algebra", you could discuss how to solve puzzles by arranging objects into rows and columns.
Avoid technical jargon and LaTeX, and only discuss middle school-level topics. The content must not contain advertisements or privacy-related information.
Start writing the educational content directly, and only output the educational content.
#### General Story
Plain
Write an engaging story related to the following text snippet: "{}".
You do not need to mention all content in the snippet; only use it for inspiration and unleash your creativity! You may add other knowledge. The story should include:
1. Niche concepts or interests: In-depth exploration of specific concepts, hobbies, interests, or humorous situations.
2. Unexpected plot twists or compelling conflicts: Introduce challenging situations or dilemmas.
3. Dialogue: The story must include at least one meaningful dialogue to reveal character depth, advance the plot, or uncover key parts of the mystery.
4. Reflection and insight: End with an educational new understanding or enlightening conclusion.
5. Characters in the story should use Chinese-style names. Do not include advertisements or privacy-related information.
Start telling the story immediately, and only output the story content.
#### Preschool Story
Plain
Webpage excerpt: "{}"
Create an educational children's story related to one or more concepts in the above webpage excerpt, targeting 5-year-old children who have zero knowledge of the world and interpersonal interactions.
You do not need to mention all content in the snippet; only use it for inspiration and unleash your creativity. The story should use simple terminology. You may add additional knowledge to aid understanding. Use easy-to-understand examples, and incorporate questions that 5-year-old children might ask along with their answers. The story should cover daily behaviors and the use of common items.
Do not use complex college-level topics such as calculus, as these are generally not understandable to young children. If the topic falls into such categories, find a simpler scientific alternative to explain using everyday examples. For example, if the topic is "linear algebra", you could discuss how to solve puzzles by arranging objects into rows and columns.
Start writing the story directly, and only output the story content.
#### WikiHow Tutorial
Plain
Webpage excerpt: "{}".
Write a long, highly detailed tutorial in the WikiHow style that is relevant to this webpage excerpt. The tutorial must include in-depth explanations of each step and how it helps achieve the desired outcome. You may freely add other relevant knowledge. Ensure clarity and practicality so that readers can easily follow the tutorial to complete the task. The content must not contain advertisements or privacy-related information. Do not use images. Start writing the tutorial directly.
## License Agreement
Usage of the Chinese Cosmopedia dataset must comply with the OpenCSG Community License. The Chinese Cosmopedia dataset supports commercial use. If you intend to use the OpenCSG model or its derivatives for commercial purposes, you must abide by the terms and conditions outlined in both the OpenCSG Community License and the Apache 2.0 License. For commercial usage, please send an email to lorraineg@opencsg.com and obtain prior permission.
**We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance technology. Stay tuned for the open-source release of the dataset!**
## Citation
@misc{yu2025opencsgchinesecorpusseries,
title={OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training},
author={Yijiong Yu and Ziyun Dai and Zekun Wang and Wei Wang and Ran Chen and Ji Pei},
year={2025},
eprint={2501.08197},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.08197},
}
提供机构:
maas
创建时间:
2024-09-25
搜集汇总
数据集介绍

背景与挑战
背景概述
Chinese Cosmopedia是一个大规模中文合成数据集,包含1500万条数据,涵盖多种来源和格式,适用于学术研究及多领域应用。
以上内容由遇见数据集搜集并总结生成



