himalaya-ai/cc100-nepali

Name: himalaya-ai/cc100-nepali
Creator: himalaya-ai
Published: 2026-04-02 18:11:47
License: 暂无描述

Hugging Face2026-04-02 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/himalaya-ai/cc100-nepali

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ne license: cc0-1.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling tags: - nepali - devanagari - low-resource - cc100 - monolingual size_categories: - 1M<n<10M --- # CC-100 Nepali — Cleaned & Deduplicated Cleaned, language-filtered, and deduplicated Nepali monolingual text derived from [CC-100](https://data.statmt.org/cc-100/), suitable for transformer pretraining. Originally published at `himalaya-ai/cc100-nepali`. Dataset contents replaced with the cleaned version from `Titung/cc100-nepali-cleaned`. ## Statistics | Split | Sentences | |------------|---------------| | train | 4,736,157 | | validation | 48,328 | | test | 48,329 | | **total** | **4,832,814** | ### Token Statistics (train split) | Metric | Value | |-------------------------|-------------------| | Total tokens | 1,003,628,468 | | Average tokens per row | 211.91 | | Max tokens in a row | 1,675 | | Min tokens in a row | 19 | > Token counts estimated using whitespace tokenization on the train split. ## Columns | Column | Type | Description | |---|---|---| | `id` | string | Unique sentence ID | | `text` | string | Nepali sentence | | `char_length` | int32 | Character count | | `word_count` | int32 | Whitespace-split word count | | `devanagari_ratio` | float32 | Fraction of Devanagari characters | | `source` | string | Always `cc100-ne` | ## Cleaning Pipeline 1. Unicode normalisation (NFC + ftfy) 2. Rule-based filters (length, Devanagari ratio ≥ 0.5, boilerplate removal) 3. Language ID — fastText `lid.176.bin`, confidence ≥ 0.7 4. Exact deduplication (MD5) 5. Near-deduplication (char 13-gram Bloom filter) 6. 98/1/1 train/validation/test split, seed 42 ## Usage ```python from datasets import load_dataset ds = load_dataset("himalaya-ai/cc100-nepali") # Filter for high-quality sentences high_q = ds["train"].filter( lambda x: x["devanagari_ratio"] > 0.8 and x["word_count"] >= 5 ) ``` ## Citation ```bibtex @inproceedings{conneau-etal-2020-unsupervised, title = {Unsupervised Cross-lingual Representation Learning at Scale}, author = {Conneau, Alexis and others}, booktitle = {Proceedings of ACL}, year = {2020} } ```

提供机构：

himalaya-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集