BAAI/CCI-Data
收藏Hugging Face2024-12-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BAAI/CCI-Data
下载链接
链接失效反馈官方服务:
资源简介:
随着大语言模型的快速发展,业界和学术界对高质量数据集的需求日益增长。这些数据集不仅需要包含大量信息,还需要经过严格的筛选和清洗,以确保其准确性和下游模型及应用的安全性。然而,目前业界流行的公共数据集存在一定的质量和安全风险,尤其是在中文领域,高质量数据集尤为缺乏。此外,构建一个安全的中文数据集也面临诸多挑战。因此,构建一个经过严格筛选和标准化处理的数据集对于LLMs的创新和发展尤为重要。我们的CCI(中文互联网语料库)数据集由来自中国大陆互联网站点的高质量、可信赖的源组成,经过了严格的数据清洗和去重处理,并在内容质量方面进行了有针对性的检测和过滤。数据处理规则包括基于规则的过滤、基于模型的过滤和去重处理。此外,我们还针对预训练数据规模大、容易导致评估数据泄露的问题,在数据处理阶段对当前主流的中文评估数据集进行了严格的筛选和过滤。发布的CCI语料库(CCI v1.0.0)大小为104GB,数据集的时间跨度为2001年1月至2023年11月。
With the rapid development of large language models (LLMs), the demand for high-quality datasets has been growing in both industry and academia. These datasets must not only contain abundant information, but also undergo rigorous screening and cleaning to ensure their accuracy and the safety of downstream models and applications. However, currently prevalent public datasets in the industry carry certain quality and security risks, especially in the Chinese domain where high-quality datasets are particularly scarce. Furthermore, constructing a secure Chinese dataset faces numerous challenges. Therefore, building a dataset that has undergone strict screening and standardized processing is particularly critical for the innovation and development of LLMs. The CCI (Chinese Internet Corpus) dataset is composed of high-quality, reliable sources from mainland Chinese websites, and has gone through strict data cleaning and deduplication, as well as targeted detection and filtering for content quality. The data processing rules include rule-based filtering, model-based filtering and deduplication. Additionally, to address the problem that large-scale pre-training data may easily cause evaluation data leakage, we carried out strict screening and filtering on current mainstream Chinese evaluation datasets during the data processing stage. The released CCI corpus (CCI v1.0.0) has a size of 104 GB, with a time span ranging from January 2001 to November 2023.
提供机构:
BAAI
原始信息汇总
数据集概述
该数据集旨在满足大型语言模型在工业界和学术界对高质量数据的需求。数据集不仅包含大量信息,还经过严格的筛选和清洗,以确保数据的准确性和下游模型及应用的安全性。目前,行业内流行的公共数据集存在一定的质量和安全风险,尤其是在中文领域,高质量数据集尤为缺乏。因此,构建一个经过严格筛选和标准化处理的数据集对于大型语言模型的创新和发展尤为重要。
搜集汇总
数据集介绍

背景与挑战
背景概述
BAAI/CCI-Data是一个高质量的中文互联网语料数据集,经过严格的数据清洗和去重处理,包含2001年1月至2023年11月的104GB数据,适用于文本生成任务。
以上内容由遇见数据集搜集并总结生成



