CCI-Data
收藏魔搭社区2026-01-02 更新2024-09-14 收录
下载链接:
https://modelscope.cn/datasets/BAAI/CCI-Data
下载链接
链接失效反馈官方服务:
资源简介:
## Data Description
With the rapid development of large language models, the demand for high-quality datasets in both the industry and academia is growing. These datasets not only need to contain a vast amount of information but also require rigorous screening and cleaning to ensure their accuracy and the safety of downstream models and applications. However, the currently popular public datasets in the industry have certain quality and security risks, especially in the Chinese domain where high-quality datasets are particularly lacking. Moreover, constructing a safe Chinese dataset also faces many challenges. Therefore, building a dataset that has undergone strict screening and standardized processing is particularly important for the innovation and development of LLMs.
Our CCI (Chinese Corpora Internet) dataset consists of high-quality, trustworthy sources from internet sites within mainland China. It has undergone rigorous data cleaning and deduplication, with targeted detection and filtering in aspects of content quality. The rules for data processing include:
- Rule-based filtering: density-based extraction, keyword filtering, spam information filtering, conversion between simplified and traditional Chinese, etc.
- Model-based filtering: filtering of low-quality content by training a classification model
- Deduplication: within and between datasets dedup
Additionally, in response to the issue of pre-training data being large in scale and prone to causing leaks of evaluation data, we specifically conduct rigorous screening and filtering of several current mainstream Chinese evaluation datasets during the data processing phase.
The CCI corpus released (CCI v1.0.0) is 104GB in size. The overall timespan of the dataset ranges from January 2001 to November 2023.
## Update
- November 29, 2023, CCI v1.0.0 released!
## Data Format
| Field | Type | Meaning |
| :-: | :-: | :-: |
| id | String | Document ID, globally unique |
| title | String | Document title |
| content | String | Content of the document |
## Sample
```json
{
"id": "a262c26c915762ae107019f2797fda03",
"title": "深圳人工智能企业闪耀东京展会",
"content": "拳头产品叫好又叫座 深圳人工智能企业闪耀东京展会 近日在东京举行的日本人工智能展上,由深圳市贸促委组织的深圳人工智能企业展团集中亮相,引起热烈关注。该展会是日本规模最大的人工智能展会,云鲸智能、思谋科技、魔耳智能、格瑞普电池、云译科技等近20家深圳人工智能代表性企业的最新人工智能产品吸引了众多当地专业观众的目光,成为展会上的一抹亮色。企业现场“揽单”,参展成果丰硕深圳市大象机器人科技有限公司是一家由海外留学人才来深创建的专注于机器人研发生产的专精特新企业,本次在东京,该公司重点展示了myCobot协作机器人和仿真宠物猫metacat等公司拳头产品。“参展期间我们接待客户数达到500位以上,有意愿成为分销伙伴、集成商或终端客户的有效意向客户近70人,成效相当不错。……"
}
```
## Download
The CCI dataset is simultaneously open-sourced on the [BAAI DataHub](https://data.baai.ac.cn/data) and Huggingface.
### BAAI DataHub
Users can click the link [CCI Dataset](https://data.baai.ac.cn/details/BAAI-CCI) to view the data files, and click to download.
Note that users need to register on BAAI DataHub to use the data, and filling out a survey questionnaire is required before their first download.
### Huggingface
To use the data, you can load it using the following code:
```python
from datasets import load_dataset
# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("BAAI/CCI-Data")
```
## User Agreement
Users need to comply with the usage agreement of the CCI dataset. You can view the agreement by clicking on the following link: ([View Usage Agreement](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf)).
## Notice
If you have any questions related to this dataset, please contact data@baai.ac.cn.
## 数据集说明
随着大语言模型(Large Language Model,LLM)的快速发展,工业界与学术界对高质量数据集的需求与日俱增。此类数据集不仅需要涵盖海量信息,还需经过严格的筛选与清洗,以保障其准确性,以及下游模型与应用的安全性。然而当前工业界主流的公开数据集存在一定的质量与安全隐患,在中文领域尤其缺乏高质量数据集。此外,构建安全的中文数据集也面临诸多挑战。因此,打造经过严格筛选与标准化处理的数据集,对大语言模型的创新与发展而言至关重要。
本团队构建的CCI(Chinese Corpora Internet)数据集,源自中国内地互联网平台的高质量、可溯源可信数据源。该数据集已完成严格的数据清洗与去重处理,并针对内容质量开展了定向检测与过滤。数据处理规则涵盖:
- 基于规则的过滤:基于密度的文本抽取、关键词过滤、垃圾信息过滤、简繁体中文转换等;
- 基于模型的过滤:通过训练分类模型过滤低质量内容;
- 去重:数据集内部与跨数据集的重复内容清除。
此外,针对预训练数据规模庞大、易引发评估数据泄露的问题,本团队在数据处理阶段专门对当前主流的多款中文评估数据集开展了严格的筛选与过滤。
本次发布的CCI语料库(CCI v1.0.0)大小为104GB,数据集的覆盖时间跨度为2001年1月至2023年11月。
## 更新
- 2023年11月29日,CCI v1.0.0正式发布!
## 数据格式
| 字段 | 类型 | 含义 |
| :-: | :-: | :-: |
| id | 字符串 | 全局唯一的文档标识符 |
| title | 字符串 | 文档标题 |
| content | 字符串 | 文档正文内容 |
## 示例
json
{
"id": "a262c26c915762ae107019f2797fda03",
"title": "深圳人工智能企业闪耀东京展会",
"content": "拳头产品叫好又叫座 深圳人工智能企业闪耀东京展会 近日在东京举行的日本人工智能展上,由深圳市贸促委组织的深圳人工智能企业展团集中亮相,引起热烈关注。该展会是日本规模最大的人工智能展会,云鲸智能、思谋科技、魔耳智能、格瑞普电池、云译科技等近20家深圳人工智能代表性企业的最新人工智能产品吸引了众多当地专业观众的目光,成为展会上的一抹亮色。企业现场“揽单”,参展成果丰硕深圳市大象机器人科技有限公司是一家由海外留学人才来深创建的专注于机器人研发生产的专精特新企业,本次在东京,该公司重点展示了myCobot协作机器人和仿真宠物猫metacat等公司拳头产品。“参展期间我们接待客户数达到500位以上,有意愿成为分销伙伴、集成商或终端客户的有效意向客户近70人,成效相当不错。……"
}
## 下载方式
CCI数据集同时在BAAI DataHub与Huggingface平台开源。
### BAAI DataHub
用户可点击链接[CCI数据集](https://data.baai.ac.cn/details/BAAI-CCI)查看数据文件并进行下载。需注意,用户需在BAAI DataHub完成注册方可使用该数据集,且首次下载前需填写调研问卷。
### Huggingface
用户可通过以下代码加载该数据集:
python
from datasets import load_dataset
# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("BAAI/CCI-Data")
## 用户协议
用户需遵守CCI数据集的使用协议,可点击以下链接查看:([查看使用协议](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf))。
## 注意事项
若您对该数据集有任何疑问,请联系邮箱data@baai.ac.cn。
提供机构:
maas
创建时间:
2024-09-12



