five

d3b4g/dhivehi-corpus

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/d3b4g/dhivehi-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - dv license: cc-by-4.0 task_categories: - text-generation - text-classification - token-classification tags: - dhivehi - maldives - thaana - low-resource - nlp - news size_categories: - 100K<n<1M --- # ދިވެހި Corpus — Dhivehi Text Corpus Clean text corpus for the Dhivehi (Maldivian) language. Built for NLP research and language model training. ## Dataset Summary | Split | Docs | Tokens | |-------|------|--------| | Train | 430,695 | ~81.6M | | Validation | 23,924 | ~4.5M | | Test | 23,924 | ~4.6M | | **Total** | **478,543** | **~90.6M** | - **Language**: Dhivehi (`dv`) — written in Thaana script (Unicode U+0780–U+07BF) - **License**: CC-BY-4.0 - **Avg quality score**: 0.958 / 1.0 - **Duplicates**: 0 (MinHash LSH deduplication at 80% threshold) ## Sources | Source | Type | Docs | |--------|------|------| | sun.mv | News | 152,815 | | vaguthu.mv | News | 134,898 | | mihaaru.com | News | 103,012 | | avas.mv | News | 50,828 | | adhadhu.com | News | 33,567 | | dv.wikipedia.org | Reference | 3,423 | ## Domain Distribution (train) | Domain | Docs | % | |--------|------|---| | news | 373,991 | 86% | | sports | 21,965 | 5% | | business | 11,980 | 2% | | entertainment | 9,306 | 2% | | lifestyle | 6,966 | 1% | | reference | 3,081 | 0.7% | | government | 2,232 | 0.5% | | religious | 720 | 0.2% | | literary | 454 | 0.1% | ## Fields | Field | Type | Description | |-------|------|-------------| | `body` | string | Main article text in Thaana script | | `title` | string | Article headline | | `source` | string | Source site (mihaaru, sun, vaguthu, avas, adhadhu, wikipedia) | | `domain` | string | Content domain (news, sports, business, etc.) | | `date` | string | Publication date (YYYY-MM-DD where available) | | `author` | string | Author name where available | | `quality_score` | float | Composite quality score 0–1 | | `token_count` | int | Word count | | `thaana_ratio` | float | Fraction of characters in Thaana Unicode range | | `url` | string | Source URL | | `doc_id` | string | MD5 hash of URL (unique identifier) | ## Quality Score Each document is scored 0–1 based on: - **Thaana ratio** (35%) — fraction of Thaana script characters - **Length score** (30%) — log-normal score peaking at ~200 tokens - **Sentence score** (20%) — presence of multiple complete sentences - **Encoding score** (15%) — absence of replacement/corrupted characters Minimum quality score in corpus: **0.663**. Average: **0.958**. ## Usage ```python from datasets import load_dataset # Load full corpus dataset = load_dataset("d3b4g/dhivehi-corpus") # Load only training split train = load_dataset("d3b4g/dhivehi-corpus", split="train") # Filter by domain news_only = train.filter(lambda x: x["domain"] == "news") # Filter by quality high_quality = train.filter(lambda x: x["quality_score"] >= 0.95) # Get just the text for language model training texts = train.select_columns(["body"]) ``` ## Use Cases - **LLM pretraining / fine-tuning** — Fine-tune multilingual models (mBERT, XLM-R, mT5) on Dhivehi - **Tokenizer training** — Train a Dhivehi-specific BPE/sentencepiece tokenizer - **Text classification** — Domain labels ready for news category classification - **NER bootstrapping** — Rich source of Maldivian person names, island names, organizations - **Sentiment analysis** — Political and social news with strong sentiment signal - **Machine translation** — Dhivehi side of a parallel corpus - **Spell checking / autocorrect** — Language model for Thaana keyboard input - **Information retrieval** — Build Dhivehi search or RAG systems ## Limitations - ~86% news domain — limited literary, legal, and colloquial register coverage - Date coverage varies by source; many articles missing publication dates - No manual annotation — domain tags and quality scores are automatically assigned ## Citation ```bibtex @dataset{dhivehi_corpus_2025, author = {d3b4g}, title = {Dhivehi Corpus: A Large-Scale Text Corpus for the Maldivian Language}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/d3b4g/dhivehi-corpus} } ``` ## License CC-BY-4.0. You are free to use, share, and adapt this dataset with attribution. Original article content copyright remains with respective publishers.

--- language: - 迪维希语(Dhivehi,代码`dv`) license: CC-BY-4.0 task_categories: - 文本生成(text-generation) - 文本分类(text-classification) - Token分类(token-classification) tags: - 迪维希语(dhivehi) - 马尔代夫(maldives) - 塔那那脚本(thaana) - 低资源(low-resource) - 自然语言处理(nlp) - 新闻(news) size_categories: - 100K<n<1M(样本量:10万至100万) --- # 迪维希语语料库 —— Dhivehi文本语料库 本语料库为迪维希语(马尔代夫官方语言)的清洁文本语料,专为自然语言处理(NLP)研究与大语言模型(LLM)训练构建。 ## 数据集概览 | 划分集 | 文档数 | 词元(Token)数 | |-------|------|--------| | 训练集 | 430,695 | ~8160万 | | 验证集 | 23,924 | ~450万 | | 测试集 | 23,924 | ~460万 | | **总计** | **478,543** | **~9060万** | - **语言**:迪维希语(`dv`),采用塔那那(Thaana)脚本书写,Unicode编码范围为U+0780–U+07BF - **许可协议**:CC-BY-4.0 - **平均质量评分**:0.958 / 1.0 - **重复数据**:无(采用MinHash局部敏感哈希去重,阈值为80%相似度) ## 数据来源 | 来源平台 | 内容类型 | 文档数 | |--------|------|------| | sun.mv | 新闻 | 152,815 | | vaguthu.mv | 新闻 | 134,898 | | mihaaru.com | 新闻 | 103,012 | | avas.mv | 新闻 | 50,828 | | adhadhu.com | 新闻 | 33,567 | | dv.wikipedia.org | 参考资料 | 3,423 | ## 训练集领域分布 | 内容领域 | 文档数 | 占比 | |--------|------|---| | 新闻 | 373,991 | 86% | | 体育 | 21,965 | 5% | | 商业 | 11,980 | 2% | | 娱乐 | 9,306 | 2% | | 生活方式 | 6,966 | 1% | | 参考资料 | 3,081 | 0.7% | | 政务 | 2,232 | 0.5% | | 宗教 | 720 | 0.2% | | 文学 | 454 | 0.1% | ## 字段说明 | 字段名 | 数据类型 | 描述 | |-------|------|-------------| | `body` | 字符串 | 采用塔那那(Thaana)脚本书写的文章正文文本 | | `title` | 字符串 | 文章标题 | | `source` | 字符串 | 来源平台(可选值:mihaaru、sun、vaguthu、avas、adhadhu、wikipedia) | | `domain` | 字符串 | 内容领域分类(如新闻、体育、商业等) | | `date` | 字符串 | 文章发布日期,若可获取则采用`YYYY-MM-DD`格式 | | `author` | 字符串 | 作者姓名,若可获取 | | `quality_score` | 浮点数 | 取值范围为0~1的综合质量评分 | | `token_count` | 整数 | 文本的词元(Token)数量 | | `thaana_ratio` | 浮点数 | 属于塔那那(Thaana)Unicode编码范围的字符占总字符的比例 | | `url` | 字符串 | 来源网页的URL地址 | | `doc_id` | 字符串 | 来源URL的MD5哈希值,作为文档唯一标识符 | ## 质量评分机制 每份文档的综合质量评分取值范围为0~1,基于以下维度加权计算: - **塔那那字符占比**(权重35%):属于塔那那(Thaana)脚本的字符占总字符的比例 - **长度评分**(权重30%):基于对数正态分布的评分,峰值约为200个词元(Token) - **句子完整性评分**(权重20%):判断文本是否包含多个完整句子 - **编码质量评分**(权重15%):判断文本是否存在替换字符或乱码问题 本语料库的最低质量评分为**0.663**,平均质量评分为**0.958**。 ## 使用示例 python from datasets import load_dataset # 加载完整语料库 dataset = load_dataset("d3b4g/dhivehi-corpus") # 仅加载训练划分集 train = load_dataset("d3b4g/dhivehi-corpus", split="train") # 按内容领域筛选,仅保留新闻类数据 news_only = train.filter(lambda x: x["domain"] == "news") # 按质量评分筛选,保留质量评分≥0.95的数据 high_quality = train.filter(lambda x: x["quality_score"] >= 0.95) # 仅提取正文文本列,用于大语言模型训练 texts = train.select_columns(["body"]) ## 应用场景 - **大语言模型(LLM)预训练/微调**:针对迪维希语微调多语言模型(如mBERT、XLM-R、mT5) - **词元器(Tokenizer)训练**:训练迪维希语专属的BPE/SentencePiece词元器 - **文本分类**:可直接使用预设的领域标签开展新闻分类任务 - **命名实体识别(NER)初始化**:包含大量马尔代夫本地人名、岛屿名与机构名的优质语料 - **情感分析**:涵盖带有明确情感倾向的政治与社会新闻 - **机器翻译**:可作为平行语料库的迪维希语侧数据 - **拼写检查/自动纠错**:为塔那那(Thaana)脚本键盘输入构建语言模型 - **信息检索**:构建迪维希语搜索引擎或检索增强生成(RAG)系统 ## 局限性 - 约86%的数据为新闻领域内容,文学、法律与口语语体的覆盖度有限 - 不同来源平台的日期覆盖度存在差异,大量文章缺失发布日期 - 无人工标注环节:领域标签与质量评分均为自动生成 ## 引用格式 bibtex @dataset{dhivehi_corpus_2025, author = {d3b4g}, title = {迪维希语语料库:面向马尔代夫语言的大规模文本语料库}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/d3b4g/dhivehi-corpus} } ## 许可协议 本数据集采用CC-BY-4.0许可协议,您可自由使用、分享与改编本数据集,但需注明原作者。原始文章的版权归各发布平台所有。
提供机构:
d3b4g
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作