Matrix
收藏魔搭社区2026-05-15 更新2024-06-01 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/Matrix
下载链接
链接失效反馈官方服务:
资源简介:
# Matrix
An open-source pretraining dataset containing 4690 billion tokens, this bilingual dataset with both English and Chinese texts is used for training neo models.
## Dataset Composition
The dataset consists of several components, each originating from different sources and serving various purposes in language modeling and processing. Below is a brief overview of each component:
<p>
<img src="https://cdn-uploads.huggingface.co/production/uploads/654907a4a1faff97850c4eff/1FWMF_t_Mhy0UQmu65Bb1.png" style="float: right; width: 400px; margin-left: 10px;">
<strong>Common Crawl</strong><br>
Extracts from the Common Crawl project, featuring a rich diversity of internet text including websites, blogs, news articles, and more.<br>
<strong>Code</strong><br>
A collection of coding-related data.<be>
<strong>Paper</strong><br>
Consists of academic and research papers covering a broad spectrum of disciplines, offering technical and domain-specific language.<br>
<strong>Book</strong><br>
Comprises texts from a range of published books, encompassing literature, non-fiction, textbooks, and more.<br>
<strong>Instruction</strong><br>
Features a collection of texts primarily in a Q&A format.<be>
<strong>Exam</strong><br>
Contains various educational materials and assessments used in academic examinations.<be>
<strong>News</strong><br>
A collection of texts from various journalistic sources, reporting on current events and news stories.<br>
<strong>Wiki</strong><br>
Articles from various encyclopedic sources, not limited to Wikipedia, covering a wide array of topics and information.<br>
<strong>Patent</strong><br>
Includes texts from patent documents, providing detailed descriptions of inventions and their applications.<br>
</p>
## Citation
```
@article{zhang2024mapneo,
title = {MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series},
author = {Ge Zhang and Scott Qu and Jiaheng Liu and Chenchen Zhang and Chenghua Lin and Chou Leuang Yu and Danny Pan and Esther Cheng and Jie Liu and Qunshu Lin and Raven Yuan and Tuney Zheng and Wei Pang and Xinrun Du and Yiming Liang and Yinghao Ma and Yizhi Li and Ziyang Ma and Bill Lin and Emmanouil Benetos and Huan Yang and Junting Zhou and Kaijing Ma and Minghao Liu and Morry Niu and Noah Wang and Quehry Que and Ruibo Liu and Sine Liu and Shawn Guo and Soren Gao and Wangchunshu Zhou and Xinyue Zhang and Yizhi Zhou and Yubo Wang and Yuelin Bai and Yuhan Zhang and Yuxiang Zhang and Zenith Wang and Zhenzhu Yang and Zijian Zhao and Jiajun Zhang and Wanli Ouyang and Wenhao Huang and Wenhu Chen},
year = {2024},
journal = {arXiv preprint arXiv: 2405.19327}
}
```
# Matrix 数据集
本数据集为开源预训练数据集,包含46900亿Token(Token),是涵盖英文与中文文本的双语数据集,用于训练Neo系列模型。
## 数据集构成
本数据集由多个组件组成,各组件来源不同,在语言建模与文本处理中承担不同用途。以下为各组件的简要介绍:
<p>
<img src="https://cdn-uploads.huggingface.co/production/uploads/654907a4a1faff97850c4eff/1FWMF_t_Mhy0UQmu65Bb1.png" style="float: right; width: 400px; margin-left: 10px;">
<strong>Common Crawl</strong><br>
源自Common Crawl项目的爬取文本,涵盖网站、博客、新闻文章等丰富多样的互联网文本资源。<br>
<strong>代码</strong><br>
各类编程相关数据的集合。<br>
<strong>学术论文</strong><br>
涵盖多学科的学术与研究论文,包含专业技术词汇与领域专属表达。<br>
<strong>图书文本</strong><br>
收录各类已出版书籍的文本内容,涵盖文学作品、非虚构读物、教科书等。<br>
<strong>指令数据</strong><br>
以问答格式为主的文本集合。<br>
<strong>考试资料</strong><br>
包含各类学术考试所用的教育材料与测评内容。<br>
<strong>新闻文本</strong><br>
来自各类新闻机构的文本集合,聚焦时事报道与各类新闻事件。<br>
<strong>百科文本</strong><br>
源自各类百科类数据源(不限于维基百科)的文章,涵盖广泛的主题与信息。<br>
<strong>专利文献</strong><br>
收录专利文档中的文本内容,提供发明及其应用的详细描述。<br>
</p>
## 引用格式
@article{zhang2024mapneo,
title = {MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series},
author = {Ge Zhang and Scott Qu and Jiaheng Liu and Chenchen Zhang and Chenghua Lin and Chou Leuang Yu and Danny Pan and Esther Cheng and Jie Liu and Qunshu Lin and Raven Yuan and Tuney Zheng and Wei Pang and Xinrun Du and Yiming Liang and Yinghao Ma and Yizhi Li and Ziyang Ma and Bill Lin and Emmanouil Benetos and Huan Yang and Junting Zhou and Kaijing Ma and Minghao Liu and Morry Niu and Noah Wang and Quehry Que and Ruibo Liu and Sine Liu and Shawn Guo and Soren Gao and Wangchunshu Zhou and Xinyue Zhang and Yizhi Zhou and Yubo Wang and Yuelin Bai and Yuhan Zhang and Yuxiang Zhang and Zenith Wang and Zhenzhu Yang and Zijian Zhao and Jiajun Zhang and Wanli Ouyang and Wenhao Huang and Wenhu Chen},
year = {2024},
journal = {arXiv preprint arXiv: 2405.19327}
}
提供机构:
maas
创建时间:
2024-05-14



