d3b4g/dhivehi-corpus
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/d3b4g/dhivehi-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- dv
license: cc-by-4.0
task_categories:
- text-generation
- text-classification
- token-classification
tags:
- dhivehi
- maldives
- thaana
- low-resource
- nlp
- news
size_categories:
- 100K<n<1M
---
# ދިވެހި Corpus — Dhivehi Text Corpus
Clean text corpus for the Dhivehi (Maldivian) language.
Built for NLP research and language model training.
## Dataset Summary
| Split | Docs | Tokens |
|-------|------|--------|
| Train | 430,695 | ~81.6M |
| Validation | 23,924 | ~4.5M |
| Test | 23,924 | ~4.6M |
| **Total** | **478,543** | **~90.6M** |
- **Language**: Dhivehi (`dv`) — written in Thaana script (Unicode U+0780–U+07BF)
- **License**: CC-BY-4.0
- **Avg quality score**: 0.958 / 1.0
- **Duplicates**: 0 (MinHash LSH deduplication at 80% threshold)
## Sources
| Source | Type | Docs |
|--------|------|------|
| sun.mv | News | 152,815 |
| vaguthu.mv | News | 134,898 |
| mihaaru.com | News | 103,012 |
| avas.mv | News | 50,828 |
| adhadhu.com | News | 33,567 |
| dv.wikipedia.org | Reference | 3,423 |
## Domain Distribution (train)
| Domain | Docs | % |
|--------|------|---|
| news | 373,991 | 86% |
| sports | 21,965 | 5% |
| business | 11,980 | 2% |
| entertainment | 9,306 | 2% |
| lifestyle | 6,966 | 1% |
| reference | 3,081 | 0.7% |
| government | 2,232 | 0.5% |
| religious | 720 | 0.2% |
| literary | 454 | 0.1% |
## Fields
| Field | Type | Description |
|-------|------|-------------|
| `body` | string | Main article text in Thaana script |
| `title` | string | Article headline |
| `source` | string | Source site (mihaaru, sun, vaguthu, avas, adhadhu, wikipedia) |
| `domain` | string | Content domain (news, sports, business, etc.) |
| `date` | string | Publication date (YYYY-MM-DD where available) |
| `author` | string | Author name where available |
| `quality_score` | float | Composite quality score 0–1 |
| `token_count` | int | Word count |
| `thaana_ratio` | float | Fraction of characters in Thaana Unicode range |
| `url` | string | Source URL |
| `doc_id` | string | MD5 hash of URL (unique identifier) |
## Quality Score
Each document is scored 0–1 based on:
- **Thaana ratio** (35%) — fraction of Thaana script characters
- **Length score** (30%) — log-normal score peaking at ~200 tokens
- **Sentence score** (20%) — presence of multiple complete sentences
- **Encoding score** (15%) — absence of replacement/corrupted characters
Minimum quality score in corpus: **0.663**. Average: **0.958**.
## Usage
```python
from datasets import load_dataset
# Load full corpus
dataset = load_dataset("d3b4g/dhivehi-corpus")
# Load only training split
train = load_dataset("d3b4g/dhivehi-corpus", split="train")
# Filter by domain
news_only = train.filter(lambda x: x["domain"] == "news")
# Filter by quality
high_quality = train.filter(lambda x: x["quality_score"] >= 0.95)
# Get just the text for language model training
texts = train.select_columns(["body"])
```
## Use Cases
- **LLM pretraining / fine-tuning** — Fine-tune multilingual models (mBERT, XLM-R, mT5) on Dhivehi
- **Tokenizer training** — Train a Dhivehi-specific BPE/sentencepiece tokenizer
- **Text classification** — Domain labels ready for news category classification
- **NER bootstrapping** — Rich source of Maldivian person names, island names, organizations
- **Sentiment analysis** — Political and social news with strong sentiment signal
- **Machine translation** — Dhivehi side of a parallel corpus
- **Spell checking / autocorrect** — Language model for Thaana keyboard input
- **Information retrieval** — Build Dhivehi search or RAG systems
## Limitations
- ~86% news domain — limited literary, legal, and colloquial register coverage
- Date coverage varies by source; many articles missing publication dates
- No manual annotation — domain tags and quality scores are automatically assigned
## Citation
```bibtex
@dataset{dhivehi_corpus_2025,
author = {d3b4g},
title = {Dhivehi Corpus: A Large-Scale Text Corpus for the Maldivian Language},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/d3b4g/dhivehi-corpus}
}
```
## License
CC-BY-4.0. You are free to use, share, and adapt this dataset with attribution.
Original article content copyright remains with respective publishers.
---
language:
- 迪维希语(Dhivehi,代码`dv`)
license: CC-BY-4.0
task_categories:
- 文本生成(text-generation)
- 文本分类(text-classification)
- Token分类(token-classification)
tags:
- 迪维希语(dhivehi)
- 马尔代夫(maldives)
- 塔那那脚本(thaana)
- 低资源(low-resource)
- 自然语言处理(nlp)
- 新闻(news)
size_categories:
- 100K<n<1M(样本量:10万至100万)
---
# 迪维希语语料库 —— Dhivehi文本语料库
本语料库为迪维希语(马尔代夫官方语言)的清洁文本语料,专为自然语言处理(NLP)研究与大语言模型(LLM)训练构建。
## 数据集概览
| 划分集 | 文档数 | 词元(Token)数 |
|-------|------|--------|
| 训练集 | 430,695 | ~8160万 |
| 验证集 | 23,924 | ~450万 |
| 测试集 | 23,924 | ~460万 |
| **总计** | **478,543** | **~9060万** |
- **语言**:迪维希语(`dv`),采用塔那那(Thaana)脚本书写,Unicode编码范围为U+0780–U+07BF
- **许可协议**:CC-BY-4.0
- **平均质量评分**:0.958 / 1.0
- **重复数据**:无(采用MinHash局部敏感哈希去重,阈值为80%相似度)
## 数据来源
| 来源平台 | 内容类型 | 文档数 |
|--------|------|------|
| sun.mv | 新闻 | 152,815 |
| vaguthu.mv | 新闻 | 134,898 |
| mihaaru.com | 新闻 | 103,012 |
| avas.mv | 新闻 | 50,828 |
| adhadhu.com | 新闻 | 33,567 |
| dv.wikipedia.org | 参考资料 | 3,423 |
## 训练集领域分布
| 内容领域 | 文档数 | 占比 |
|--------|------|---|
| 新闻 | 373,991 | 86% |
| 体育 | 21,965 | 5% |
| 商业 | 11,980 | 2% |
| 娱乐 | 9,306 | 2% |
| 生活方式 | 6,966 | 1% |
| 参考资料 | 3,081 | 0.7% |
| 政务 | 2,232 | 0.5% |
| 宗教 | 720 | 0.2% |
| 文学 | 454 | 0.1% |
## 字段说明
| 字段名 | 数据类型 | 描述 |
|-------|------|-------------|
| `body` | 字符串 | 采用塔那那(Thaana)脚本书写的文章正文文本 |
| `title` | 字符串 | 文章标题 |
| `source` | 字符串 | 来源平台(可选值:mihaaru、sun、vaguthu、avas、adhadhu、wikipedia) |
| `domain` | 字符串 | 内容领域分类(如新闻、体育、商业等) |
| `date` | 字符串 | 文章发布日期,若可获取则采用`YYYY-MM-DD`格式 |
| `author` | 字符串 | 作者姓名,若可获取 |
| `quality_score` | 浮点数 | 取值范围为0~1的综合质量评分 |
| `token_count` | 整数 | 文本的词元(Token)数量 |
| `thaana_ratio` | 浮点数 | 属于塔那那(Thaana)Unicode编码范围的字符占总字符的比例 |
| `url` | 字符串 | 来源网页的URL地址 |
| `doc_id` | 字符串 | 来源URL的MD5哈希值,作为文档唯一标识符 |
## 质量评分机制
每份文档的综合质量评分取值范围为0~1,基于以下维度加权计算:
- **塔那那字符占比**(权重35%):属于塔那那(Thaana)脚本的字符占总字符的比例
- **长度评分**(权重30%):基于对数正态分布的评分,峰值约为200个词元(Token)
- **句子完整性评分**(权重20%):判断文本是否包含多个完整句子
- **编码质量评分**(权重15%):判断文本是否存在替换字符或乱码问题
本语料库的最低质量评分为**0.663**,平均质量评分为**0.958**。
## 使用示例
python
from datasets import load_dataset
# 加载完整语料库
dataset = load_dataset("d3b4g/dhivehi-corpus")
# 仅加载训练划分集
train = load_dataset("d3b4g/dhivehi-corpus", split="train")
# 按内容领域筛选,仅保留新闻类数据
news_only = train.filter(lambda x: x["domain"] == "news")
# 按质量评分筛选,保留质量评分≥0.95的数据
high_quality = train.filter(lambda x: x["quality_score"] >= 0.95)
# 仅提取正文文本列,用于大语言模型训练
texts = train.select_columns(["body"])
## 应用场景
- **大语言模型(LLM)预训练/微调**:针对迪维希语微调多语言模型(如mBERT、XLM-R、mT5)
- **词元器(Tokenizer)训练**:训练迪维希语专属的BPE/SentencePiece词元器
- **文本分类**:可直接使用预设的领域标签开展新闻分类任务
- **命名实体识别(NER)初始化**:包含大量马尔代夫本地人名、岛屿名与机构名的优质语料
- **情感分析**:涵盖带有明确情感倾向的政治与社会新闻
- **机器翻译**:可作为平行语料库的迪维希语侧数据
- **拼写检查/自动纠错**:为塔那那(Thaana)脚本键盘输入构建语言模型
- **信息检索**:构建迪维希语搜索引擎或检索增强生成(RAG)系统
## 局限性
- 约86%的数据为新闻领域内容,文学、法律与口语语体的覆盖度有限
- 不同来源平台的日期覆盖度存在差异,大量文章缺失发布日期
- 无人工标注环节:领域标签与质量评分均为自动生成
## 引用格式
bibtex
@dataset{dhivehi_corpus_2025,
author = {d3b4g},
title = {迪维希语语料库:面向马尔代夫语言的大规模文本语料库},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/d3b4g/dhivehi-corpus}
}
## 许可协议
本数据集采用CC-BY-4.0许可协议,您可自由使用、分享与改编本数据集,但需注明原作者。原始文章的版权归各发布平台所有。
提供机构:
d3b4g



