five

d3b4g/maldivian-legal-corpus

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/d3b4g/maldivian-legal-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - dv license: cc-by-4.0 task_categories: - text-generation - translation - summarization - question-answering pretty_name: Maldivian Legal Corpus size_categories: - n<1K tags: - legal - dhivehi - maldives - law - legislation - government - low-resource configs: - config_name: default data_files: - split: train path: data/laws/train.jsonl - split: test path: data/laws/test.jsonl default: true - config_name: sections data_files: - split: train path: data/sections/train.jsonl - split: test path: data/sections/test.jsonl --- # Maldivian Legal Corpus (V1.1) Open, structured corpus of Maldivian laws published on HuggingFace — sourced from the official [MVLaw portal](https://mvlaw.gov.mv) maintained by the Attorney General's Office of the Maldives. ## Dataset summary 235 published Maldivian laws in Dhivehi (ދިވެހި), spanning from 1932 to 2025. | Stat | Value | |---|---| | Total laws | 235 (215 Train / 20 Test) | | Total sections | ~14,000 (12,742 Train / 1,246 Test) | | Estimated Dhivehi tokens | ~3,000,000 | | Average chars per law | ~50,000 | | Average sections per law | ~59 | | Language | Dhivehi (`dv`) | | Source | [mvlaw.gov.mv](https://mvlaw.gov.mv/dv) | | License | CC-BY-4.0 | ## Why this dataset matters Dhivehi is a **critically low-resource language** spoken by ~340,000 people in the Maldives. This dataset enables: - **LLM pretraining / fine-tuning** on Dhivehi legal text. - **Legal RAG systems** for Maldivian law (granular section-level chunking). - **Translation research** — Dhivehi ↔ English legal translation. - **Legal NLP benchmarking** for a low-resource language. - **Civic tech** — searchable Maldivian law for citizens and lawyers. ## Dataset structure ### Configurations 1. **`default` (Laws)**: Full legal documents (1 row = 1 Law). 2. **`sections` (Granular)**: Broken down by Articles/Chapters (1 row = 1 Section). ### Fields (default config) | Field | Type | Description | |---|---|---| | `id` | int | Unique legislation ID from MVLaw | | `title` | string | Law title in Dhivehi | | `slug` | string | URL-safe identifier | | `legislation_no` | string | Official law number (e.g. `17/2021`) | | `status` | string | Always `published` in this release | | `implemented_at` | string | Enactment date (ISO 8601) | | `pdf_url` | string | Direct URL to official PDF | | `web_url` | string | URL to law on MVLaw portal | | `categories` | list[string] | Category tags (e.g. `["ޤާނޫނު", "ވޮލިއުމް VIII"]`) | | `text_content` | string | Full law text in Dhivehi with `[section heading]` markers | ## How to use ### 1. Load full laws ```python from datasets import load_dataset ds = load_dataset("d3b4g/maldivian-legal-corpus", split="train") print(ds[0]["title"]) ``` ### 2. Load granular sections (for RAG) ```python from datasets import load_dataset ds_sections = load_dataset("d3b4g/maldivian-legal-corpus", "sections", split="train") print(ds_sections[0]["section_title"]) ``` ### 3. Load test split for evaluation ```python from datasets import load_dataset ds_test = load_dataset("d3b4g/maldivian-legal-corpus", split="test") ``` ## Known limitations - **18 laws excluded** — scanned image-only PDFs (mostly pre-1970). OCR versions will be added in V2. - **2 laws excluded** — failed to fetch at scrape time; will be re-scraped. - **No English translations** — An English translation is planned for V2. - **Arabic-numeral law numbers** — some very old laws use Arabic-Indic numerals (e.g. `.٨/ ٣ ١`). ## Data source & provenance All laws are sourced from [mvlaw.gov.mv](https://mvlaw.gov.mv/dv), the official Maldivian laws portal maintained by the Attorney General's Office of the Republic of Maldives. Data was collected via web scraping in 2025. ## License This dataset is released under [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). Underlying laws are public government documents. Please cite this dataset if you use it in research. ## Citation ```bibtex @dataset{maldivian_legal_corpus_2025, author = {d3b4g}, title = {Maldivian Legal Corpus}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/datasets/d3b4g/maldivian-legal-corpus}, note = {235 published Maldivian laws in Dhivehi, sourced from mvlaw.gov.mv} } ``` ## Roadmap - [x] v1.1: Add `sections` config with one row per article (DONE) - [ ] v2: Add English translations
提供机构:
d3b4g
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作