d3b4g/dhivehi-corpus

Name: d3b4g/dhivehi-corpus
Creator: d3b4g
Published: 2026-03-25 17:21:29
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/d3b4g/dhivehi-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - dv license: cc-by-4.0 task_categories: - text-generation - text-classification - token-classification tags: - dhivehi - maldives - thaana - low-resource - nlp - news size_categories: - 100K<n<1M --- # ދިވެހި Corpus — Dhivehi Text Corpus Clean text corpus for the Dhivehi (Maldivian) language. Built for NLP research and language model training. ## Dataset Summary | Split | Docs | Tokens | |-------|------|--------| | Train | 430,695 | ~81.6M | | Validation | 23,924 | ~4.5M | | Test | 23,924 | ~4.6M | | **Total** | **478,543** | **~90.6M** | - **Language**: Dhivehi (`dv`) — written in Thaana script (Unicode U+0780–U+07BF) - **License**: CC-BY-4.0 - **Avg quality score**: 0.958 / 1.0 - **Duplicates**: 0 (MinHash LSH deduplication at 80% threshold) ## Sources | Source | Type | Docs | |--------|------|------| | sun.mv | News | 152,815 | | vaguthu.mv | News | 134,898 | | mihaaru.com | News | 103,012 | | avas.mv | News | 50,828 | | adhadhu.com | News | 33,567 | | dv.wikipedia.org | Reference | 3,423 | ## Domain Distribution (train) | Domain | Docs | % | |--------|------|---| | news | 373,991 | 86% | | sports | 21,965 | 5% | | business | 11,980 | 2% | | entertainment | 9,306 | 2% | | lifestyle | 6,966 | 1% | | reference | 3,081 | 0.7% | | government | 2,232 | 0.5% | | religious | 720 | 0.2% | | literary | 454 | 0.1% | ## Fields | Field | Type | Description | |-------|------|-------------| | `body` | string | Main article text in Thaana script | | `title` | string | Article headline | | `source` | string | Source site (mihaaru, sun, vaguthu, avas, adhadhu, wikipedia) | | `domain` | string | Content domain (news, sports, business, etc.) | | `date` | string | Publication date (YYYY-MM-DD where available) | | `author` | string | Author name where available | | `quality_score` | float | Composite quality score 0–1 | | `token_count` | int | Word count | | `thaana_ratio` | float | Fraction of characters in Thaana Unicode range | | `url` | string | Source URL | | `doc_id` | string | MD5 hash of URL (unique identifier) | ## Quality Score Each document is scored 0–1 based on: - **Thaana ratio** (35%) — fraction of Thaana script characters - **Length score** (30%) — log-normal score peaking at ~200 tokens - **Sentence score** (20%) — presence of multiple complete sentences - **Encoding score** (15%) — absence of replacement/corrupted characters Minimum quality score in corpus: **0.663**. Average: **0.958**. ## Usage ```python from datasets import load_dataset # Load full corpus dataset = load_dataset("d3b4g/dhivehi-corpus") # Load only training split train = load_dataset("d3b4g/dhivehi-corpus", split="train") # Filter by domain news_only = train.filter(lambda x: x["domain"] == "news") # Filter by quality high_quality = train.filter(lambda x: x["quality_score"] >= 0.95) # Get just the text for language model training texts = train.select_columns(["body"]) ``` ## Use Cases - **LLM pretraining / fine-tuning** — Fine-tune multilingual models (mBERT, XLM-R, mT5) on Dhivehi - **Tokenizer training** — Train a Dhivehi-specific BPE/sentencepiece tokenizer - **Text classification** — Domain labels ready for news category classification - **NER bootstrapping** — Rich source of Maldivian person names, island names, organizations - **Sentiment analysis** — Political and social news with strong sentiment signal - **Machine translation** — Dhivehi side of a parallel corpus - **Spell checking / autocorrect** — Language model for Thaana keyboard input - **Information retrieval** — Build Dhivehi search or RAG systems ## Limitations - ~86% news domain — limited literary, legal, and colloquial register coverage - Date coverage varies by source; many articles missing publication dates - No manual annotation — domain tags and quality scores are automatically assigned ## Citation ```bibtex @dataset{dhivehi_corpus_2025, author = {d3b4g}, title = {Dhivehi Corpus: A Large-Scale Text Corpus for the Maldivian Language}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/d3b4g/dhivehi-corpus} } ``` ## License CC-BY-4.0. You are free to use, share, and adapt this dataset with attribution. Original article content copyright remains with respective publishers.

--- language: - 迪维希语（Dhivehi，代码`dv`） license: CC-BY-4.0 task_categories: - 文本生成（text-generation） - 文本分类（text-classification） - Token分类（token-classification） tags: - 迪维希语（dhivehi） - 马尔代夫（maldives） - 塔那那脚本（thaana） - 低资源（low-resource） - 自然语言处理（nlp） - 新闻（news） size_categories: - 100K<n<1M（样本量：10万至100万） --- # 迪维希语语料库 —— Dhivehi文本语料库本语料库为迪维希语（马尔代夫官方语言）的清洁文本语料，专为自然语言处理（NLP）研究与大语言模型（LLM）训练构建。 ## 数据集概览 | 划分集 | 文档数 | 词元（Token）数 | |-------|------|--------| | 训练集 | 430,695 | ~8160万 | | 验证集 | 23,924 | ~450万 | | 测试集 | 23,924 | ~460万 | | **总计** | **478,543** | **~9060万** | - **语言**：迪维希语（`dv`），采用塔那那（Thaana）脚本书写，Unicode编码范围为U+0780–U+07BF - **许可协议**：CC-BY-4.0 - **平均质量评分**：0.958 / 1.0 - **重复数据**：无（采用MinHash局部敏感哈希去重，阈值为80%相似度） ## 数据来源 | 来源平台 | 内容类型 | 文档数 | |--------|------|------| | sun.mv | 新闻 | 152,815 | | vaguthu.mv | 新闻 | 134,898 | | mihaaru.com | 新闻 | 103,012 | | avas.mv | 新闻 | 50,828 | | adhadhu.com | 新闻 | 33,567 | | dv.wikipedia.org | 参考资料 | 3,423 | ## 训练集领域分布 | 内容领域 | 文档数 | 占比 | |--------|------|---| | 新闻 | 373,991 | 86% | | 体育 | 21,965 | 5% | | 商业 | 11,980 | 2% | | 娱乐 | 9,306 | 2% | | 生活方式 | 6,966 | 1% | | 参考资料 | 3,081 | 0.7% | | 政务 | 2,232 | 0.5% | | 宗教 | 720 | 0.2% | | 文学 | 454 | 0.1% | ## 字段说明 | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `body` | 字符串 | 采用塔那那（Thaana）脚本书写的文章正文文本 | | `title` | 字符串 | 文章标题 | | `source` | 字符串 | 来源平台（可选值：mihaaru、sun、vaguthu、avas、adhadhu、wikipedia） | | `domain` | 字符串 | 内容领域分类（如新闻、体育、商业等） | | `date` | 字符串 | 文章发布日期，若可获取则采用`YYYY-MM-DD`格式 | | `author` | 字符串 | 作者姓名，若可获取 | | `quality_score` | 浮点数 | 取值范围为0~1的综合质量评分 | | `token_count` | 整数 | 文本的词元（Token）数量 | | `thaana_ratio` | 浮点数 | 属于塔那那（Thaana）Unicode编码范围的字符占总字符的比例 | | `url` | 字符串 | 来源网页的URL地址 | | `doc_id` | 字符串 | 来源URL的MD5哈希值，作为文档唯一标识符 | ## 质量评分机制每份文档的综合质量评分取值范围为0~1，基于以下维度加权计算： - **塔那那字符占比**（权重35%）：属于塔那那（Thaana）脚本的字符占总字符的比例 - **长度评分**（权重30%）：基于对数正态分布的评分，峰值约为200个词元（Token） - **句子完整性评分**（权重20%）：判断文本是否包含多个完整句子 - **编码质量评分**（权重15%）：判断文本是否存在替换字符或乱码问题本语料库的最低质量评分为**0.663**，平均质量评分为**0.958**。 ## 使用示例 python from datasets import load_dataset # 加载完整语料库 dataset = load_dataset("d3b4g/dhivehi-corpus") # 仅加载训练划分集 train = load_dataset("d3b4g/dhivehi-corpus", split="train") # 按内容领域筛选，仅保留新闻类数据 news_only = train.filter(lambda x: x["domain"] == "news") # 按质量评分筛选，保留质量评分≥0.95的数据 high_quality = train.filter(lambda x: x["quality_score"] >= 0.95) # 仅提取正文文本列，用于大语言模型训练 texts = train.select_columns(["body"]) ## 应用场景 - **大语言模型（LLM）预训练/微调**：针对迪维希语微调多语言模型（如mBERT、XLM-R、mT5） - **词元器（Tokenizer）训练**：训练迪维希语专属的BPE/SentencePiece词元器 - **文本分类**：可直接使用预设的领域标签开展新闻分类任务 - **命名实体识别（NER）初始化**：包含大量马尔代夫本地人名、岛屿名与机构名的优质语料 - **情感分析**：涵盖带有明确情感倾向的政治与社会新闻 - **机器翻译**：可作为平行语料库的迪维希语侧数据 - **拼写检查/自动纠错**：为塔那那（Thaana）脚本键盘输入构建语言模型 - **信息检索**：构建迪维希语搜索引擎或检索增强生成（RAG）系统 ## 局限性 - 约86%的数据为新闻领域内容，文学、法律与口语语体的覆盖度有限 - 不同来源平台的日期覆盖度存在差异，大量文章缺失发布日期 - 无人工标注环节：领域标签与质量评分均为自动生成 ## 引用格式 bibtex @dataset{dhivehi_corpus_2025, author = {d3b4g}, title = {迪维希语语料库：面向马尔代夫语言的大规模文本语料库}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/d3b4g/dhivehi-corpus} } ## 许可协议本数据集采用CC-BY-4.0许可协议，您可自由使用、分享与改编本数据集，但需注明原作者。原始文章的版权归各发布平台所有。

提供机构：

d3b4g

5,000+

优质数据集

54 个

任务类型

进入经典数据集