Titung/cc100-nepali-cleaned
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Titung/cc100-nepali-cleaned
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ne
license: cc0-1.0
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
pretty_name: CC-100 Nepali (Cleaned)
size_categories:
- 1M<n<10M
tags:
- nepali
- devanagari
- low-resource
- cc100
- monolingual
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: validation
path: data/validation-*.parquet
- split: test
path: data/test-*.parquet
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: char_length
dtype: int32
- name: word_count
dtype: int32
- name: devanagari_ratio
dtype: float32
- name: source
dtype: string
splits:
- name: train
num_examples: 4736157
- name: validation
num_examples: 48328
- name: test
num_examples: 48329
---
# CC-100 Nepali — Cleaned & Deduplicated
Cleaned, language-filtered, and deduplicated Nepali monolingual text from
[CC-100](https://data.statmt.org/cc-100/) suitable for transformer pretraining.
## Statistics
| Split | Sentences |
|---|---|
| train | 4,736,157 |
| validation | 48,328 |
| test | 48,329 |
| **total** | **4,832,814** |
Created: 2026-04-02
## Pipeline
1. Unicode normalisation (NFC + ftfy)
2. Rule-based filters (length, Devanagari ratio ≥ 0.5, boilerplate)
3. Language ID — fastText lid.176.bin, confidence ≥ 0.7
4. Exact deduplication (MD5)
5. Near-deduplication (char 13-gram bloom filter)
6. 98/1/1 train/val/test split, seed 42
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Titung/cc100-nepali-cleaned")
# Filter high-quality sentences
high_q = ds["train"].filter(lambda x: x["devanagari_ratio"] > 0.8 and x["word_count"] >= 5)
```
## Citation
```bibtex
@inproceedings{conneau-etal-2020-unsupervised,
title = {Unsupervised Cross-lingual Representation Learning at Scale},
author = {Conneau, Alexis et al.},
year = {2020}
}
```
提供机构:
Titung



