CCI 中文互联网语料库

超神经2024-01-29 更新2024-05-15 收录

下载链接：

https://hyper.ai/cn/datasets/29186

下载链接

链接失效反馈

官方服务：

资源简介：

随着大型语言模型的快速发展，工业界和学术界对高质量数据集的需求不断增长。这些数据集不仅需要包含海量的信息，还需要经过严格的筛选和清洗，以确保其准确性以及下游模型和应用的安全。然而，目前业界流行的公共数据集存在一定的质量和安全风险，尤其是在高质量数据集尤其缺乏的中文领域。此外，构建安全的中文数据集还面临诸多挑战。因此，构建经过严格筛选和标准化处理的数据集对于 LLMs 的创新和发展尤为重要。

With the rapid advancement of large language models (LLMs), the demand for high-quality datasets has been continuously growing across both industry and academia. Such datasets are required to not only encompass massive volumes of information, but also undergo rigorous screening and cleansing to ensure their accuracy and the safety of downstream models and applications. However, currently prevalent public datasets in the industry carry certain quality and security risks, particularly in the Chinese domain where high-quality datasets are especially scarce. Furthermore, developing secure Chinese datasets faces a multitude of challenges. Therefore, constructing datasets that have undergone strict screening and standardized processing is particularly vital for the innovation and advancement of LLMs.

创建时间：

2024-01-29

搜集汇总

数据集介绍

背景与挑战

背景概述

CCI 中文互联网语料库是一个针对中文领域构建的高质量数据集，旨在满足大型语言模型对安全、准确数据的需求。它源自中国大陆互联网网站，经过严格的数据清洗、过滤和去重处理，规模为104 GB，覆盖2001年1月至2023年11月的时间范围。该数据集专注于提升内容质量，以支持LLMs的创新和发展。

以上内容由遇见数据集搜集并总结生成