shjwudp/chinese-c4
收藏Hugging Face2023-06-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/shjwudp/chinese-c4
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- zh
---
## Introduction
Chinese-C4 is a clean Chinese internet dataset based on Common Crawl. The dataset is 46.29GB and has undergone multiple cleaning strategies, including Chinese filtering, heuristic cleaning based on punctuation, line-based hashing for deduplication, and repetition removal.
The dataset is open source and free for commercial use, and you are welcome to use the data and the cleaning strategies provided and contribute your cleaning strategies.
You can find the cleaning script for the dataset on GitHub [c4-dataset-script](https://github.com/shjwudp/c4-dataset-script).
提供机构:
shjwudp
原始信息汇总
数据集概述
数据集名称
- Chinese-C4
数据集大小
- 46.29GB
数据集语言
- 中文
数据集来源
- 基于Common Crawl的清洁互联网数据集
数据集处理
- 多重清洁策略:
- 中文过滤
- 基于标点符号的启发式清洁
- 行级哈希去重
- 重复内容移除
许可证
- CC BY 4.0
使用许可
- 开源且可用于商业用途
相关资源
- 清洁脚本可在GitHub上找到:c4-dataset-script



