cyankiwi/c4-sample
收藏Hugging Face2026-03-08 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/cyankiwi/c4-sample
下载链接
链接失效反馈官方服务:
资源简介:
# C4 Evaluation Dataset
A multilingual subset of [allenai/c4](https://huggingface.co/datasets/allenai/c4) for measuring information-theoretic properties (perplexity, KL divergence) of LLMs.
## Dataset
- **File:** `c4_evaluation.parquet`
- **Total samples:** 2,994
- **Total tokens:** 2,007,152 (tokenized with `Qwen/Qwen3.5-35B-A3B`)
- **Columns:** `text`, `language`, `source`
## Language Composition
| Language | Code | Family | Samples | Tokens | % of tokens |
|----------|------|--------|--------:|-------:|------------:|
| English | `en` | Germanic | 1,244 | 695,908 | 34.7% |
| Chinese | `zh` | Sino-Tibetan | 316 | 245,202 | 12.2% |
| Japanese | `ja` | Japonic | 183 | 163,306 | 8.1% |
| Spanish | `es` | Romance | 236 | 160,108 | 8.0% |
| French | `fr` | Romance | 141 | 102,462 | 5.1% |
| German | `de` | Germanic | 143 | 101,514 | 5.1% |
| Korean | `ko` | Koreanic | 114 | 101,298 | 5.0% |
| Russian | `ru` | Slavic | 152 | 99,622 | 5.0% |
| Arabic | `ar` | Semitic | 141 | 99,938 | 5.0% |
| Portuguese | `pt` | Romance | 128 | 80,766 | 4.0% |
| Hindi | `hi` | Indo-Aryan | 95 | 78,244 | 3.9% |
| Tamil | `ta` | Dravidian | 101 | 78,784 | 3.9% |
# C4 评估数据集(C4 Evaluation Dataset)
本数据集为[allenai/c4](https://huggingface.co/datasets/allenai/c4)的多语言子集,用于评测大语言模型的信息论属性(困惑度、KL散度)。
## 数据集详情
- **文件:** `c4_evaluation.parquet`
- **总样本量:** 2994
- **总词元(Token)数:** 2,007,152(采用`Qwen/Qwen3.5-35B-A3B`进行词元化)
- **数据列:** `text`、`language`、`source`
## 语言构成
| 语言名称 | 语言代码 | 语系 | 样本量 | 词元数 | 词元占比 |
|----------|----------|------|--------:|-------:|------------:|
| 英语 | `en` | 日耳曼语族 | 1244 | 695,908 | 34.7% |
| 汉语 | `zh` | 汉藏语系 | 316 | 245,202 | 12.2% |
| 日语 | `ja` | 日本语族 | 183 | 163,306 | 8.1% |
| 西班牙语 | `es` | 罗曼语族 | 236 | 160,108 | 8.0% |
| 法语 | `fr` | 罗曼语族 | 141 | 102,462 | 5.1% |
| 德语 | `de` | 日耳曼语族 | 143 | 101,514 | 5.1% |
| 韩语 | `ko` | 朝鲜语族 | 114 | 101,298 | 5.0% |
| 俄语 | `ru` | 斯拉夫语族 | 152 | 99,622 | 5.0% |
| 阿拉伯语 | `ar` | 闪含语系闪米特语族 | 141 | 99,938 | 5.0% |
| 葡萄牙语 | `pt` | 罗曼语族 | 128 | 80,766 | 4.0% |
| 印地语 | `hi` | 印欧语系印度-雅利安语族 | 95 | 78,244 | 3.9% |
| 泰米尔语 | `ta` | 达罗毗荼语系 | 101 | 78,784 | 3.9% |
提供机构:
cyankiwi



