five

Titung/cc100-nepali-cleaned

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Titung/cc100-nepali-cleaned
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ne license: cc0-1.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling pretty_name: CC-100 Nepali (Cleaned) size_categories: - 1M<n<10M tags: - nepali - devanagari - low-resource - cc100 - monolingual configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: validation path: data/validation-*.parquet - split: test path: data/test-*.parquet dataset_info: features: - name: id dtype: string - name: text dtype: string - name: char_length dtype: int32 - name: word_count dtype: int32 - name: devanagari_ratio dtype: float32 - name: source dtype: string splits: - name: train num_examples: 4736157 - name: validation num_examples: 48328 - name: test num_examples: 48329 --- # CC-100 Nepali — Cleaned & Deduplicated Cleaned, language-filtered, and deduplicated Nepali monolingual text from [CC-100](https://data.statmt.org/cc-100/) suitable for transformer pretraining. ## Statistics | Split | Sentences | |---|---| | train | 4,736,157 | | validation | 48,328 | | test | 48,329 | | **total** | **4,832,814** | Created: 2026-04-02 ## Pipeline 1. Unicode normalisation (NFC + ftfy) 2. Rule-based filters (length, Devanagari ratio ≥ 0.5, boilerplate) 3. Language ID — fastText lid.176.bin, confidence ≥ 0.7 4. Exact deduplication (MD5) 5. Near-deduplication (char 13-gram bloom filter) 6. 98/1/1 train/val/test split, seed 42 ## Usage ```python from datasets import load_dataset ds = load_dataset("Titung/cc100-nepali-cleaned") # Filter high-quality sentences high_q = ds["train"].filter(lambda x: x["devanagari_ratio"] > 0.8 and x["word_count"] >= 5) ``` ## Citation ```bibtex @inproceedings{conneau-etal-2020-unsupervised, title = {Unsupervised Cross-lingual Representation Learning at Scale}, author = {Conneau, Alexis et al.}, year = {2020} } ```
提供机构:
Titung
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作