Titung/cc100-nepali-cleaned

Name: Titung/cc100-nepali-cleaned
Creator: Titung
Published: 2026-04-02 12:38:30
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Titung/cc100-nepali-cleaned

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ne license: cc0-1.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling pretty_name: CC-100 Nepali (Cleaned) size_categories: - 1M<n<10M tags: - nepali - devanagari - low-resource - cc100 - monolingual configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: validation path: data/validation-*.parquet - split: test path: data/test-*.parquet dataset_info: features: - name: id dtype: string - name: text dtype: string - name: char_length dtype: int32 - name: word_count dtype: int32 - name: devanagari_ratio dtype: float32 - name: source dtype: string splits: - name: train num_examples: 4736157 - name: validation num_examples: 48328 - name: test num_examples: 48329 --- # CC-100 Nepali — Cleaned & Deduplicated Cleaned, language-filtered, and deduplicated Nepali monolingual text from [CC-100](https://data.statmt.org/cc-100/) suitable for transformer pretraining. ## Statistics | Split | Sentences | |---|---| | train | 4,736,157 | | validation | 48,328 | | test | 48,329 | | **total** | **4,832,814** | Created: 2026-04-02 ## Pipeline 1. Unicode normalisation (NFC + ftfy) 2. Rule-based filters (length, Devanagari ratio ≥ 0.5, boilerplate) 3. Language ID — fastText lid.176.bin, confidence ≥ 0.7 4. Exact deduplication (MD5) 5. Near-deduplication (char 13-gram bloom filter) 6. 98/1/1 train/val/test split, seed 42 ## Usage ```python from datasets import load_dataset ds = load_dataset("Titung/cc100-nepali-cleaned") # Filter high-quality sentences high_q = ds["train"].filter(lambda x: x["devanagari_ratio"] > 0.8 and x["word_count"] >= 5) ``` ## Citation ```bibtex @inproceedings{conneau-etal-2020-unsupervised, title = {Unsupervised Cross-lingual Representation Learning at Scale}, author = {Conneau, Alexis et al.}, year = {2020} } ```

提供机构：

Titung

5,000+

优质数据集

54 个

任务类型

进入经典数据集