himalaya-ai/cc100-nepali
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/himalaya-ai/cc100-nepali
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ne
license: cc0-1.0
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
tags:
- nepali
- devanagari
- low-resource
- cc100
- monolingual
size_categories:
- 1M<n<10M
---
# CC-100 Nepali — Cleaned & Deduplicated
Cleaned, language-filtered, and deduplicated Nepali monolingual text derived from
[CC-100](https://data.statmt.org/cc-100/), suitable for transformer pretraining.
Originally published at `himalaya-ai/cc100-nepali`.
Dataset contents replaced with the cleaned version from `Titung/cc100-nepali-cleaned`.
## Statistics
| Split | Sentences |
|------------|---------------|
| train | 4,736,157 |
| validation | 48,328 |
| test | 48,329 |
| **total** | **4,832,814** |
### Token Statistics (train split)
| Metric | Value |
|-------------------------|-------------------|
| Total tokens | 1,003,628,468 |
| Average tokens per row | 211.91 |
| Max tokens in a row | 1,675 |
| Min tokens in a row | 19 |
> Token counts estimated using whitespace tokenization on the train split.
## Columns
| Column | Type | Description |
|---|---|---|
| `id` | string | Unique sentence ID |
| `text` | string | Nepali sentence |
| `char_length` | int32 | Character count |
| `word_count` | int32 | Whitespace-split word count |
| `devanagari_ratio` | float32 | Fraction of Devanagari characters |
| `source` | string | Always `cc100-ne` |
## Cleaning Pipeline
1. Unicode normalisation (NFC + ftfy)
2. Rule-based filters (length, Devanagari ratio ≥ 0.5, boilerplate removal)
3. Language ID — fastText `lid.176.bin`, confidence ≥ 0.7
4. Exact deduplication (MD5)
5. Near-deduplication (char 13-gram Bloom filter)
6. 98/1/1 train/validation/test split, seed 42
## Usage
```python
from datasets import load_dataset
ds = load_dataset("himalaya-ai/cc100-nepali")
# Filter for high-quality sentences
high_q = ds["train"].filter(
lambda x: x["devanagari_ratio"] > 0.8 and x["word_count"] >= 5
)
```
## Citation
```bibtex
@inproceedings{conneau-etal-2020-unsupervised,
title = {Unsupervised Cross-lingual Representation Learning at Scale},
author = {Conneau, Alexis and others},
booktitle = {Proceedings of ACL},
year = {2020}
}
```
提供机构:
himalaya-ai



