opencsg/chinese-cosmopedia
收藏Hugging Face2025-01-15 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/opencsg/chinese-cosmopedia
下载链接
链接失效反馈官方服务:
资源简介:
Chinese Cosmopedia数据集共包含1500万条数据,约60B个token,构建合成数据集的两个核心要素是种子数据和prompt。种子数据决定了生成内容的主题,prompt则决定了数据的风格(如教科书、故事、教程或幼儿读物)。数据来源丰富多样,涵盖了中文维基百科、中文百科、知识问答和技术博客等平台,确保内容的广泛性和权威性。生成的数据形式多样,涵盖大学教科书、中学教科书、幼儿故事、普通故事和WikiHow风格教程等多种不同风格。通过对每条种子数据生成多种不同风格的内容,数据集不仅适用于学术研究,还广泛应用于教育、娱乐和技术领域。
The Chinese Cosmopedia dataset contains a total of 15 million entries, approximately 60B tokens. Two key elements in constructing the synthetic dataset are seed data and prompts. Seed data determines the theme of the generated content, while prompts define the style of the data (such as textbooks, stories, tutorials, or childrens books). The data sources are diverse, including Chinese Wikipedia, Chinese Baike, knowledge Q&A, and technical blogs, ensuring both the breadth and authority of the content. The generated data comes in various formats, such as university textbooks, middle school textbooks, childrens stories, ordinary stories, and WikiHow-style tutorials. By generating multiple styles for each piece of seed data, the dataset is not only suitable for academic research but also widely applicable in education, entertainment, and technology fields.
提供机构:
opencsg



