five

himalaya-ai/cc100-nepali

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/himalaya-ai/cc100-nepali
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ne license: cc0-1.0 task_categories: - text-generation - fill-mask task_ids: - language-modeling tags: - nepali - devanagari - low-resource - cc100 - monolingual size_categories: - 1M<n<10M --- # CC-100 Nepali — Cleaned & Deduplicated Cleaned, language-filtered, and deduplicated Nepali monolingual text derived from [CC-100](https://data.statmt.org/cc-100/), suitable for transformer pretraining. Originally published at `himalaya-ai/cc100-nepali`. Dataset contents replaced with the cleaned version from `Titung/cc100-nepali-cleaned`. ## Statistics | Split | Sentences | |------------|---------------| | train | 4,736,157 | | validation | 48,328 | | test | 48,329 | | **total** | **4,832,814** | ### Token Statistics (train split) | Metric | Value | |-------------------------|-------------------| | Total tokens | 1,003,628,468 | | Average tokens per row | 211.91 | | Max tokens in a row | 1,675 | | Min tokens in a row | 19 | > Token counts estimated using whitespace tokenization on the train split. ## Columns | Column | Type | Description | |---|---|---| | `id` | string | Unique sentence ID | | `text` | string | Nepali sentence | | `char_length` | int32 | Character count | | `word_count` | int32 | Whitespace-split word count | | `devanagari_ratio` | float32 | Fraction of Devanagari characters | | `source` | string | Always `cc100-ne` | ## Cleaning Pipeline 1. Unicode normalisation (NFC + ftfy) 2. Rule-based filters (length, Devanagari ratio ≥ 0.5, boilerplate removal) 3. Language ID — fastText `lid.176.bin`, confidence ≥ 0.7 4. Exact deduplication (MD5) 5. Near-deduplication (char 13-gram Bloom filter) 6. 98/1/1 train/validation/test split, seed 42 ## Usage ```python from datasets import load_dataset ds = load_dataset("himalaya-ai/cc100-nepali") # Filter for high-quality sentences high_q = ds["train"].filter( lambda x: x["devanagari_ratio"] > 0.8 and x["word_count"] >= 5 ) ``` ## Citation ```bibtex @inproceedings{conneau-etal-2020-unsupervised, title = {Unsupervised Cross-lingual Representation Learning at Scale}, author = {Conneau, Alexis and others}, booktitle = {Proceedings of ACL}, year = {2020} } ```
提供机构:
himalaya-ai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作