MAP-CC
收藏魔搭社区2025-12-05 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/MAP-CC
下载链接
链接失效反馈官方服务:
资源简介:
# MAP-CC
[**🌐 Homepage**](https://chinese-tiny-llm.github.io) | [**🤗 MAP-CC**](https://huggingface.co/datasets/m-a-p/MAP-CC) | [**🤗 CHC-Bench**](https://huggingface.co/datasets/m-a-p/CHC-Bench) | [**🤗 CT-LLM**](https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6) | [**📖 arXiv**](https://arxiv.org/abs/2404.04167) | [**GitHub**](https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM)
An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.
## Disclaimer
This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of model applications prevent us from ensuring absolute accuracy or appropriateness of the model outputs in every scenario.
It is essential to highlight that our model and its associated training data are intended solely for scholarly research. We explicitly disclaim any liability for problems that may arise from improper use, interpretation errors, unlawful activities, the dissemination of false information, or any data security issues related to the utilization of our model or its training data.
We strongly encourage users to report any concerns related to data misuse, security breaches, or potential infringement issues directly to us for immediate investigation and resolution.
### Contact: {`ge.zhang@uwaterloo.ca; duxinrun2000@gmail.com`}
Our commitment to responsible data sharing and the security of our academic tools is paramount. We thank you for your cooperation in maintaining the ethical use of this technology.
## License
The MAP-CC Dataset is made available under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License ([CC BY-NC-ND 4.0](LICENSE)).
By using the MAP-CC Dataset, you accept and agree to be bound by the terms and conditions of the CC BY-NC-ND 4.0 License. This license allows users to share (copy and redistribute the material in any medium or format) the MAP-CC Dataset for non-commercial purposes only, and with no modifications or derivatives, as long as proper attribution is given to the creators. For further details, please refer to the [LICENSE](LICENSE) file.
We chose the CC BY-NC-ND 4.0 License for the MAP-CC Dataset to facilitate academic and educational use, promoting the spread of knowledge while protecting the work of the creators from unauthorized commercial use or modification.
## Usage Instructions
After downloading the parts of the dataset, you can concatenate them into a single file for each split of the dataset using the following command in a UNIX-like terminal:
```bash
cat [split].gz.part* > [split].gz
```
Replace [split] with the name of the dataset component you wish to merge (zh-cc, zh-baike, zh-papers, zh-books, or zh-others). After merging, decompress the .gz file to access the dataset's content.
## Dataset Composition
The dataset consists of several components, each originating from different sources and serving various purposes in language modeling and processing. Below is a brief overview of each component:
<p>
<img src="data-ratio.png" style="float: right; width: 400px; margin-left: 10px;">
<strong>zh-cc (Chinese Common Crawl)</strong><br>
Extracts from the Common Crawl project specifically filtered for Chinese content. This component is rich in diverse internet text, ranging from websites, blogs, news articles, and more.<br><br>
<strong>zh-baike (Chinese Encyclopedias)</strong><br>
A collection of articles from various Chinese encyclopedias, similar to Wikipedia but including other encyclopedic sources as well.<br><br>
<strong>zh-papers (Chinese Academic Papers)</strong><br>
This component consists of academic and research papers published in Chinese. It covers a wide range of disciplines and offers technical, domain-specific language.<br><br>
<strong>zh-books (Chinese Books)</strong><br>
Comprises texts extracted from books published in Chinese. This includes literature, non-fiction, textbooks, and more.<br><br>
<strong>zh-others</strong><br>
This category is a collection of miscellaneous texts, notably including a substantial amount of QA (Question and Answer) data, alongside a variety of other texts.<br>
</p>
## Citation
```
@misc{du2024chinesetinyllmpretraining,
title={Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model},
author={Xinrun Du and Zhouliang Yu and Songyang Gao and Ding Pan and Yuyang Cheng and Ziyang Ma and Ruibin Yuan and Xingwei Qu and Jiaheng Liu and Tianyu Zheng and Xinchen Luo and Guorui Zhou and Wenhu Chen and Ge Zhang},
year={2024},
eprint={2404.04167},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.04167},
}
```
# MAP-CC
[**🌐 主页**](https://chinese-tiny-llm.github.io) | [**🤗 MAP-CC**](https://huggingface.co/datasets/m-a-p/MAP-CC) | [**🤗 CHC-Bench**](https://huggingface.co/datasets/m-a-p/CHC-Bench) | [**🤗 CT-LLM**](https://huggingface.co/collections/m-a-p/chinese-tiny-llm-660d0133dff6856f94ce0fc6) | [**📖 arXiv**](https://arxiv.org/abs/2404.04167) | [**GitHub**](https://github.com/Chinese-Tiny-LLM/Chinese-Tiny-LLM)
本数据集为开源中文预训练数据集,总规模达8000亿Token,可为自然语言处理(NLP,Natural Language Processing)社区提供高质量中文预训练数据。
## 免责声明
本模型专为学术研究开发,采用经过严格合规性审核的训练数据,以保障最高标准的诚信性与合规性。尽管我们已尽最大努力,但由于数据本身的复杂性以及模型应用场景的广泛性,我们无法确保模型输出在所有场景下均绝对准确或恰当。
需要强调的是,本模型及其关联训练数据仅用于学术研究。我们明确声明,对于因不当使用、解读错误、非法活动、虚假信息传播,或使用本模型及其训练数据相关的任何数据安全问题所引发的一切问题,不承担任何责任。
我们强烈鼓励用户直接向我们报告任何与数据滥用、安全漏洞或潜在侵权相关的问题,以便我们立即开展调查与解决。
### 联系方式:{`ge.zhang@uwaterloo.ca; duxinrun2000@gmail.com`}
我们始终将负责任的数据共享与学术工具的安全置于首位。感谢您配合本技术的合规使用。
## 许可证
MAP-CC 数据集依据知识共享署名-非商业性使用-禁止演绎4.0国际许可协议(CC BY-NC-ND 4.0)条款进行发布(详见[LICENSE](LICENSE))。
使用MAP-CC数据集即表示您接受并同意受CC BY-NC-ND 4.0许可协议的条款约束。本许可协议允许用户仅以非商业目的共享(以任何媒介或形式复制和重新分发材料)MAP-CC数据集,且不得进行修改或演绎,同时需为创作者提供恰当署名。详细信息请参阅[LICENSE](LICENSE)文件。
我们选择CC BY-NC-ND 4.0协议用于MAP-CC数据集,旨在促进学术与教育用途,推动知识传播,同时保护创作者的作品免受未经授权的商业使用或修改。
## 使用说明
下载数据集分卷后,您可在类UNIX终端中使用以下命令将各分卷合并为单个数据集拆分文件:
bash
cat [split].gz.part* > [split].gz
将`[split]`替换为您需要合并的数据集组件名称(zh-cc、zh-baike、zh-papers、zh-books或zh-others)。合并完成后,解压该.gz文件即可访问数据集内容。
## 数据集构成
本数据集由多个组件组成,每个组件源自不同来源,可服务于语言建模与处理的各类场景。以下为各组件的简要介绍:
<p>
<img src="data-ratio.png" style="float: right; width: 400px; margin-left: 10px;">
<strong>zh-cc(中文通用爬虫数据集)</strong><br>
从通用爬虫(Common Crawl)项目中提取并专门筛选的中文内容。该组件包含丰富多样的互联网文本,涵盖网站、博客、新闻文章等多种类型。<br><br>
<strong>zh-baike(中文百科数据集)</strong><br>
源自各类中文百科的文章集合,类似维基百科,但同时包含其他百科类来源。<br><br>
<strong>zh-papers(中文学术论文数据集)</strong><br>
由已发表的中文学术与研究论文组成,涵盖广泛的学科领域,包含专业的领域特定语言。<br><br>
<strong>zh-books(中文图书数据集)</strong><br>
提取自中文出版图书的文本,包括文学作品、非虚构类作品、教科书等。<br><br>
<strong>zh-others(其他杂项数据集)</strong><br>
该类别为各类杂项文本的集合,其中包含大量问答(QA,Question and Answer)数据,以及其他多种类型的文本。<br>
</p>
## 引用格式
bibtex
@misc{du2024chinesetinyllmpretraining,
title={Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model},
author={Xinrun Du and Zhouliang Yu and Songyang Gao and Ding Pan and Yuyang Cheng and Ziyang Ma and Ruibin Yuan and Xingwei Qu and Jiaheng Liu and Tianyu Zheng and Xinchen Luo and Guorui Zhou and Wenhu Chen and Ge Zhang},
year={2024},
eprint={2404.04167},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2404.04167},
}
提供机构:
maas
创建时间:
2024-04-13
搜集汇总
数据集介绍

背景与挑战
背景概述
MAP-CC是一个开源的汉语预训练数据集,规模达8000亿个标记,包含来自互联网、百科全书、学术论文、书籍等多种来源的中文文本。该数据集采用CC BY-NC-ND 4.0许可证,仅限非商业用途且不允许修改。
以上内容由遇见数据集搜集并总结生成



