dienmoc/vietnamese-legal-documents
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dienmoc/vietnamese-legal-documents
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
license: cc-by-4.0
pretty_name: Vietnamese Legal Documents
size_categories:
- 100K<n<1M
task_categories:
- text-classification
- text-generation
- question-answering
- summarization
tags:
- legal
- vietnamese
- law
- government
- NLP
- text-mining
configs:
- config_name: metadata
data_files:
- split: data
path: metadata/data-*.parquet
- config_name: content
data_files:
- split: data
path: content/data-*.parquet
dataset_info:
- config_name: metadata
features:
- name: id
dtype: int64
- name: document_number
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: legal_type
dtype: string
- name: legal_sectors
dtype: string
- name: issuing_authority
dtype: string
- name: issuance_date
dtype: string
- name: signers
dtype: string
num_rows: 518255
- config_name: content
features:
- name: id
dtype: int64
- name: content
dtype: string
num_rows: 518255
---
# Vietnamese Legal Documents
A comprehensive dataset of **518,255 Vietnamese legal documents** sourced from
[thuvienphapluat.vn](https://thuvienphapluat.vn) — the largest Vietnamese legal
document repository. The dataset covers laws, decrees, circulars, decisions, and
other official documents issued by Vietnamese government bodies, spanning from
**1924 to 2026**.
---
## At a Glance
| | |
|---|---|
| 🗂️ **Total documents** | 518,255 |
| 📅 **Date range** | 1924 – 2026 |
| 🏛️ **Issuing authorities** | 1,335 unique bodies |
| 📋 **Document types** | 36 unique types |
| 🌐 **Language** | Vietnamese |
| 💾 **Content size** | ~3.6 GB (parquet) |
---
## Dataset Structure
This dataset is split into **two configs** to allow fast metadata access without loading the full text.
| Config | Split | Rows | Size | Description |
|---|---|---|---|---|
| `metadata` | `data` | 518,255 | ~82 MB | 9 metadata columns, no text content |
| `content` | `data` | 518,255 | ~3.6 GB | `id` + full markdown document text |
Join on the `id` column to get both metadata and content.
---
## Load the Dataset
```python
from datasets import load_dataset
# Load metadata only (fast, ~82 MB)
ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")
df = ds["data"].to_pandas()
print(df.head())
# Load full text content (~3.6 GB)
ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content")
# Join metadata + content
import pandas as pd
meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas()
text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas()
df = meta.merge(text, on="id")
print(df.columns.tolist())
```
---
## Schema
### `metadata` config
| Column | Type | Description |
|---|---|---|
| `id` | int64 | Unique numeric document ID |
| `document_number` | string | Official document number (e.g. `115/NQ-HĐBCQG`) |
| `title` | string | Full Vietnamese title |
| `url` | string | Source URL on thuvienphapluat.vn |
| `legal_type` | string | Document type (Quyết định, Công văn, Nghị quyết, …) |
| `legal_sectors` | string | Pipe-separated sector/topic tags |
| `issuing_authority` | string | Name of the issuing government body |
| `issuance_date` | string | Issue date in `DD/MM/YYYY` format |
| `signers` | string | Pipe-separated `name:id` pairs of signatories |
### `content` config
| Column | Type | Description |
|---|---|---|
| `id` | int64 | Document ID — join key with the `metadata` config |
| `content` | string | Full document text converted to Markdown |
---
## Statistics
### Documents by Year

### Top 15 Document Types

### Top 15 Legal Sectors

---
## Use Cases
- 🔍 **Legal information retrieval** — build search engines over Vietnamese law
- 🤖 **LLM fine-tuning** — train or fine-tune language models on legal Vietnamese
- 📊 **Legal NLP research** — NER, classification, summarization, QA
- 📈 **Policy analysis** — track legislative trends over time
- 🌏 **Low-resource NLP** — Vietnamese legal text is underrepresented in existing datasets
---
## Data Collection
This is an independent personal research project. Documents were collected from
[thuvienphapluat.vn](https://thuvienphapluat.vn) — a public legal document portal
— via their sitemap. This project has **no affiliation with
thuvienphapluat.vn**.
HTML content was converted to Markdown using BeautifulSoup. Only Vietnamese-language
documents were retained; English versions and technical standards (Tiêu chuẩn) were
excluded.
---
## License & Legal Basis
Vietnamese legal documents (laws, decrees, circulars, decisions, and other normative
acts) are **public domain by Vietnamese law**. Under the
[Law on Access to Information (Luật Tiếp cận thông tin, No. 104/2016/QH13)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-tiep-can-thong-tin-2016-280116.aspx)
and the [Law on Promulgation of Legal Documents (No. 64/2025/QH15)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-ban-hanh-van-ban-quy-pham-phap-luat-2025-so-64-2025-QH15-639239.aspx),
official legal normative documents issued by state agencies must be made publicly
accessible free of charge.
The **compiled dataset** (collection, processing, metadata schema, and Markdown
conversion) is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
**Intended for research purposes only.**
---
## Citation
```bibtex
@dataset{ngo_thinh_2026_vietnamese_legal,
title = {Vietnamese Legal Documents},
author = {Ngô, Thịnh},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents},
note = {518,255 Vietnamese legal documents compiled for research purposes}
}
```
---
language:
- 越南语(Vietnamese)
license: 知识共享署名4.0许可协议(CC BY 4.0)
pretty_name: 越南法律文档
size_categories:
- 10万至100万份
task_categories:
- 文本分类
- 文本生成
- 问答任务
- 文本摘要
tags:
- 法律
- 越南语
- 法学
- 政务
- 自然语言处理(NLP)
- 文本挖掘
configs:
- config_name: metadata
data_files:
- split: data
path: metadata/data-*.parquet
- config_name: content
data_files:
- split: data
path: content/data-*.parquet
dataset_info:
- config_name: metadata
features:
- name: id
dtype: int64
- name: document_number
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: legal_type
dtype: string
- name: legal_sectors
dtype: string
- name: issuing_authority
dtype: string
- name: issuance_date
dtype: string
- name: signers
dtype: string
num_rows: 518255
- config_name: content
features:
- name: id
dtype: int64
- name: content
dtype: string
num_rows: 518255
---
# 越南法律文档
本数据集涵盖**518,255份越南法律文档**,数据源自越南最大的法律文档库[thuvienphapluat.vn](https://thuvienphapluat.vn)。数据集涵盖越南政府机构发布的法律、法令、通知、决定及其他官方文件,时间跨度为**1924年至2026年**。
---
## 概览
| | |
|---|---|
| 🗂️ **总文档数** | 518,255 |
| 📅 **时间范围** | 1924 – 2026 |
| 🏛️ **发布机构** | 1335个唯一主体 |
| 📋 **文档类型** | 36种唯一类型 |
| 🌐 **语言** | 越南语 |
| 💾 **内容体量** | 约3.6 GB(Parquet格式) |
---
## 数据集结构
本数据集分为**两个配置项**,以便在无需加载完整文本的情况下快速访问元数据。
| 配置项 | 拆分方式 | 行数 | 体量 | 描述 |
|---|---|---|---|---|
| `metadata` | `data` | 518,255 | ~82 MB | 包含9个元数据列,无文本内容 |
| `content` | `data` | 518,255 | ~3.6 GB | 包含`id`字段与完整Markdown格式的文档文本 |
可通过`id`列关联元数据与内容数据。
---
## 加载数据集
python
from datasets import load_dataset
# 仅加载元数据(加载快速,体量约82 MB)
ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")
df = ds["data"].to_pandas()
print(df.head())
# 加载完整文本内容(体量约3.6 GB)
ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content")
# 关联元数据与内容数据
import pandas as pd
meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas()
text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas()
df = meta.merge(text, on="id")
print(df.columns.tolist())
---
## 数据模式
### `metadata` 配置项
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `id` | int64 | 唯一的文档数字ID |
| `document_number` | string | 官方文档编号(例如`115/NQ-HĐBCQG`) |
| `title` | string | 完整越南语标题 |
| `url` | string | 源网站thuvienphapluat.vn上的链接 |
| `legal_type` | string | 文档类型(如决定、公文、决议等) |
| `legal_sectors` | string | 以竖线分隔的领域/主题标签 |
| `issuing_authority` | string | 发布该文档的政府机构名称 |
| `issuance_date` | string | 发布日期,格式为`DD/MM/YYYY` |
| `signers` | string | 以竖线分隔的签署人`姓名:ID`对 |
### `content` 配置项
| 列名 | 数据类型 | 描述 |
|---|---|---|
| `id` | int64 | 文档ID,与`metadata`配置项的关联键 |
| `content` | string | 转换为Markdown格式的完整文档文本 |
---
## 统计数据
### 按年份分布的文档数

### 前15种文档类型

### 前15个法律领域

---
## 应用场景
- 🔍 **法律信息检索**——构建越南语法律搜索引擎
- 🤖 **大语言模型(LLM)微调**——基于越南语法律文本训练或微调语言模型
- 📊 **法律自然语言处理研究**——命名实体识别、分类、摘要、问答等任务
- 📈 **政策分析**——追踪长期立法趋势
- 🌏 **低资源自然语言处理**——越南语法律文本在现有数据集中占比极低
---
## 数据采集
本项目为独立个人研究项目。文档通过网站地图从公共法律文档门户[thuvienphapluat.vn](https://thuvienphapluat.vn)采集,本项目与thuvienphapluat.vn无任何隶属关系。
使用BeautifulSoup将HTML内容转换为Markdown格式。仅保留越南语文档,排除英文版本与技术标准(Tiêu chuẩn)。
---
## 许可与法律依据
根据越南法律,越南法律文档(法律、法令、通知、决定及其他规范性文件)属于**公共领域**。根据《信息获取法》(Luật Tiếp cận thông tin,第104/2016/QH13号)与《法律文件颁布法》(第64/2025/QH15号),国家机构发布的官方法律规范性文件必须免费向公众开放。
本**编译数据集**(包括数据采集、处理、元数据架构与Markdown转换流程)按照[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可协议发布。
本数据集仅用于研究目的。
---
## 引用
bibtex
@dataset{ngo_thinh_2026_vietnamese_legal,
title = {Vietnamese Legal Documents},
author = {Ngô, Thịnh},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents},
note = {518,255份越南法律文档,专为研究目的编译}
}
提供机构:
dienmoc



