d3b4g/maldivian-legal-corpus
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/d3b4g/maldivian-legal-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- dv
license: cc-by-4.0
task_categories:
- text-generation
- translation
- summarization
- question-answering
pretty_name: Maldivian Legal Corpus
size_categories:
- n<1K
tags:
- legal
- dhivehi
- maldives
- law
- legislation
- government
- low-resource
configs:
- config_name: default
data_files:
- split: train
path: data/laws/train.jsonl
- split: test
path: data/laws/test.jsonl
default: true
- config_name: sections
data_files:
- split: train
path: data/sections/train.jsonl
- split: test
path: data/sections/test.jsonl
---
# Maldivian Legal Corpus (V1.1)
Open, structured corpus of Maldivian laws published on HuggingFace — sourced from the official [MVLaw portal](https://mvlaw.gov.mv) maintained by the Attorney General's Office of the Maldives.
## Dataset summary
235 published Maldivian laws in Dhivehi (ދިވެހި), spanning from 1932 to 2025.
| Stat | Value |
|---|---|
| Total laws | 235 (215 Train / 20 Test) |
| Total sections | ~14,000 (12,742 Train / 1,246 Test) |
| Estimated Dhivehi tokens | ~3,000,000 |
| Average chars per law | ~50,000 |
| Average sections per law | ~59 |
| Language | Dhivehi (`dv`) |
| Source | [mvlaw.gov.mv](https://mvlaw.gov.mv/dv) |
| License | CC-BY-4.0 |
## Why this dataset matters
Dhivehi is a **critically low-resource language** spoken by ~340,000 people in the Maldives. This dataset enables:
- **LLM pretraining / fine-tuning** on Dhivehi legal text.
- **Legal RAG systems** for Maldivian law (granular section-level chunking).
- **Translation research** — Dhivehi ↔ English legal translation.
- **Legal NLP benchmarking** for a low-resource language.
- **Civic tech** — searchable Maldivian law for citizens and lawyers.
## Dataset structure
### Configurations
1. **`default` (Laws)**: Full legal documents (1 row = 1 Law).
2. **`sections` (Granular)**: Broken down by Articles/Chapters (1 row = 1 Section).
### Fields (default config)
| Field | Type | Description |
|---|---|---|
| `id` | int | Unique legislation ID from MVLaw |
| `title` | string | Law title in Dhivehi |
| `slug` | string | URL-safe identifier |
| `legislation_no` | string | Official law number (e.g. `17/2021`) |
| `status` | string | Always `published` in this release |
| `implemented_at` | string | Enactment date (ISO 8601) |
| `pdf_url` | string | Direct URL to official PDF |
| `web_url` | string | URL to law on MVLaw portal |
| `categories` | list[string] | Category tags (e.g. `["ޤާނޫނު", "ވޮލިއުމް VIII"]`) |
| `text_content` | string | Full law text in Dhivehi with `[section heading]` markers |
## How to use
### 1. Load full laws
```python
from datasets import load_dataset
ds = load_dataset("d3b4g/maldivian-legal-corpus", split="train")
print(ds[0]["title"])
```
### 2. Load granular sections (for RAG)
```python
from datasets import load_dataset
ds_sections = load_dataset("d3b4g/maldivian-legal-corpus", "sections", split="train")
print(ds_sections[0]["section_title"])
```
### 3. Load test split for evaluation
```python
from datasets import load_dataset
ds_test = load_dataset("d3b4g/maldivian-legal-corpus", split="test")
```
## Known limitations
- **18 laws excluded** — scanned image-only PDFs (mostly pre-1970). OCR versions will be added in V2.
- **2 laws excluded** — failed to fetch at scrape time; will be re-scraped.
- **No English translations** — An English translation is planned for V2.
- **Arabic-numeral law numbers** — some very old laws use Arabic-Indic numerals (e.g. `.٨/ ٣ ١`).
## Data source & provenance
All laws are sourced from [mvlaw.gov.mv](https://mvlaw.gov.mv/dv), the official Maldivian laws portal maintained by the Attorney General's Office of the Republic of Maldives. Data was collected via web scraping in 2025.
## License
This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). Underlying laws are public government documents. Please cite this dataset if you use it in research.
## Citation
```bibtex
@dataset{maldivian_legal_corpus_2025,
author = {d3b4g},
title = {Maldivian Legal Corpus},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/datasets/d3b4g/maldivian-legal-corpus},
note = {235 published Maldivian laws in Dhivehi, sourced from mvlaw.gov.mv}
}
```
## Roadmap
- [x] v1.1: Add `sections` config with one row per article (DONE)
- [ ] v2: Add English translations
提供机构:
d3b4g



