Huymaeco/vietnamese-legal-documents
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Huymaeco/vietnamese-legal-documents
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
license: cc-by-4.0
pretty_name: Vietnamese Legal Documents
size_categories:
- 100K<n<1M
task_categories:
- text-classification
- text-generation
- question-answering
- summarization
tags:
- legal
- vietnamese
- law
- government
- NLP
- text-mining
configs:
- config_name: metadata
data_files:
- split: data
path: metadata/data-*.parquet
- config_name: content
data_files:
- split: data
path: content/data-*.parquet
dataset_info:
- config_name: metadata
features:
- name: id
dtype: int64
- name: document_number
dtype: string
- name: title
dtype: string
- name: url
dtype: string
- name: legal_type
dtype: string
- name: legal_sectors
dtype: string
- name: issuing_authority
dtype: string
- name: issuance_date
dtype: string
- name: signers
dtype: string
num_rows: 518255
- config_name: content
features:
- name: id
dtype: int64
- name: content
dtype: string
num_rows: 518255
---
# Vietnamese Legal Documents
A comprehensive dataset of **518,255 Vietnamese legal documents** sourced from
[thuvienphapluat.vn](https://thuvienphapluat.vn) — the largest Vietnamese legal
document repository. The dataset covers laws, decrees, circulars, decisions, and
other official documents issued by Vietnamese government bodies, spanning from
**1924 to 2026**.
---
## At a Glance
| | |
|---|---|
| 🗂️ **Total documents** | 518,255 |
| 📅 **Date range** | 1924 – 2026 |
| 🏛️ **Issuing authorities** | 1,335 unique bodies |
| 📋 **Document types** | 36 unique types |
| 🌐 **Language** | Vietnamese |
| 💾 **Content size** | ~3.6 GB (parquet) |
---
## Dataset Structure
This dataset is split into **two configs** to allow fast metadata access without loading the full text.
| Config | Split | Rows | Size | Description |
|---|---|---|---|---|
| `metadata` | `data` | 518,255 | ~82 MB | 9 metadata columns, no text content |
| `content` | `data` | 518,255 | ~3.6 GB | `id` + full markdown document text |
Join on the `id` column to get both metadata and content.
---
## Load the Dataset
```python
from datasets import load_dataset
# Load metadata only (fast, ~82 MB)
ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")
df = ds["data"].to_pandas()
print(df.head())
# Load full text content (~3.6 GB)
ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content")
# Join metadata + content
import pandas as pd
meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas()
text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas()
df = meta.merge(text, on="id")
print(df.columns.tolist())
```
---
## Schema
### `metadata` config
| Column | Type | Description |
|---|---|---|
| `id` | int64 | Unique numeric document ID |
| `document_number` | string | Official document number (e.g. `115/NQ-HĐBCQG`) |
| `title` | string | Full Vietnamese title |
| `url` | string | Source URL on thuvienphapluat.vn |
| `legal_type` | string | Document type (Quyết định, Công văn, Nghị quyết, …) |
| `legal_sectors` | string | Pipe-separated sector/topic tags |
| `issuing_authority` | string | Name of the issuing government body |
| `issuance_date` | string | Issue date in `DD/MM/YYYY` format |
| `signers` | string | Pipe-separated `name:id` pairs of signatories |
### `content` config
| Column | Type | Description |
|---|---|---|
| `id` | int64 | Document ID — join key with the `metadata` config |
| `content` | string | Full document text converted to Markdown |
---
## Statistics
### Documents by Year

### Top 15 Document Types

### Top 15 Legal Sectors

---
## Use Cases
- 🔍 **Legal information retrieval** — build search engines over Vietnamese law
- 🤖 **LLM fine-tuning** — train or fine-tune language models on legal Vietnamese
- 📊 **Legal NLP research** — NER, classification, summarization, QA
- 📈 **Policy analysis** — track legislative trends over time
- 🌏 **Low-resource NLP** — Vietnamese legal text is underrepresented in existing datasets
---
## Data Collection
This is an independent personal research project. Documents were collected from
[thuvienphapluat.vn](https://thuvienphapluat.vn) — a public legal document portal
— via their sitemap and mobile API. This project has **no affiliation with
thuvienphapluat.vn**.
HTML content was converted to Markdown using BeautifulSoup. Only Vietnamese-language
documents were retained; English versions and technical standards (Tiêu chuẩn) were
excluded.
---
## License & Legal Basis
Vietnamese legal documents (laws, decrees, circulars, decisions, and other normative
acts) are **public domain by Vietnamese law**. Under the
[Law on Access to Information (Luật Tiếp cận thông tin, No. 104/2016/QH13)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-tiep-can-thong-tin-2016-280116.aspx)
and the [Law on Promulgation of Legal Documents (No. 64/2025/QH15)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-ban-hanh-van-ban-quy-pham-phap-luat-2025-so-64-2025-QH15-639239.aspx),
official legal normative documents issued by state agencies must be made publicly
accessible free of charge.
The **compiled dataset** (collection, processing, metadata schema, and Markdown
conversion) is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).
**Intended for research purposes only.**
---
## Citation
```bibtex
@dataset{ngo_thinh_2026_vietnamese_legal,
title = {Vietnamese Legal Documents},
author = {Ngô, Thịnh},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents},
note = {518,255 Vietnamese legal documents compiled for research purposes}
}
```
语言:
- 越南语(vi)
许可协议:CC BY 4.0
友好名称:越南法律文档
规模类别:
- 10万 < n < 100万
任务类别:
- 文本分类
- 文本生成
- 问答
- 摘要
标签:
- 法律
- 越南语
- 法学
- 政务
- 自然语言处理(Natural Language Processing,NLP)
- 文本挖掘
配置项:
- 配置名称:元数据
数据文件:
- 拆分:data
路径:metadata/data-*.parquet
- 配置名称:内容
数据文件:
- 拆分:data
路径:content/data-*.parquet
数据集信息:
- 配置名称:元数据
特征字段:
- 字段名:id
数据类型:int64
- 字段名:document_number
数据类型:字符串
- 字段名:title
数据类型:字符串
- 字段名:url
数据类型:字符串
- 字段名:legal_type
数据类型:字符串
- 字段名:legal_sectors
数据类型:字符串
- 字段名:issuing_authority
数据类型:字符串
- 字段名:issuance_date
数据类型:字符串
- 字段名:signers
数据类型:字符串
行数:518255
- 配置名称:内容
特征字段:
- 字段名:id
数据类型:int64
- 字段名:content
数据类型:字符串
行数:518255
---
# 越南法律文档
这是一套包含518,255份越南法律文档的综合性数据集,数据源自越南最大的法律文档库[thuvienphapluat.vn](https://thuvienphapluat.vn)。本数据集涵盖越南政府机构发布的法律、法令、通知、决定及其他官方文件,时间跨度为**1924年至2026年**。
---
## 概览
| | |
|---|---|
| 🗂️ **总文档数** | 518,255 |
| 📅 **时间范围** | 1924 – 2026 |
| 🏛️ **发布机构** | 1,335个独特主体 |
| 📋 **文档类型** | 36种独特类型 |
| 🌐 **语言** | 越南语 |
| 💾 **内容体量** | ~3.6 GB(Parquet格式) |
---
## 数据集结构
本数据集分为**两个配置项**,以便在无需加载完整文本的情况下快速访问元数据。
| 配置项 | 拆分 | 行数 | 体量 | 描述 |
|---|---|---|---|---|
| `metadata` | `data` | 518,255 | ~82 MB | 包含9个元数据字段,无文本内容 |
| `content` | `data` | 518,255 | ~3.6 GB | 包含`id`与完整Markdown格式的文档文本 |
可通过`id`列进行关联,以同时获取元数据与文档内容。
---
## 数据集加载
python
from datasets import load_dataset
# 仅加载元数据(加载速度快,约82 MB)
ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")
df = ds["data"].to_pandas()
print(df.head())
# 加载完整文本内容(约3.6 GB)
ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content")
# 关联元数据与内容
import pandas as pd
meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas()
text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas()
df = meta.merge(text, on="id")
print(df.columns.tolist())
---
## 数据模式
### `metadata` 配置项
| 字段名 | 数据类型 | 描述 |
|---|---|---|
| `id` | int64 | 唯一数字文档ID |
| `document_number` | 字符串 | 官方文档编号(例如`115/NQ-HĐBCQG`) |
| `title` | 字符串 | 完整越南语标题 |
| `url` | 字符串 | thuvienphapluat.vn上的来源链接 |
| `legal_type` | 字符串 | 文档类型(如决定、公文、决议等) |
| `legal_sectors` | 字符串 | 以竖线分隔的领域/主题标签 |
| `issuing_authority` | 字符串 | 发布该文档的政府机构名称 |
| `issuance_date` | 字符串 | 发布日期,格式为`DD/MM/YYYY` |
| `signers` | 字符串 | 以竖线分隔的`姓名:ID`格式签署人对 |
### `content` 配置项
| 字段名 | 数据类型 | 描述 |
|---|---|---|
| `id` | int64 | 文档ID——与`metadata`配置项的关联键 |
| `content` | 字符串 | 转换为Markdown格式的完整文档文本 |
---
## 统计信息
### 按年份分布的文档

### 前15种文档类型

### 前15个法律领域

---
## 应用场景
- 🔍 **法律信息检索**:构建越南法律搜索引擎
- 🤖 **大语言模型(Large Language Model,LLM)微调**:基于越南语法律文本训练或微调语言模型
- 📊 **法律自然语言处理研究**:开展命名实体识别、分类、摘要、问答等任务
- 📈 **政策分析**:追踪长期立法趋势
- 🌏 **低资源自然语言处理**:越南语法律文本在现有数据集中占比极低
---
## 数据采集
本项目为独立个人研究项目。文档通过网站地图与移动API从公共法律文档门户[thuvienphapluat.vn](https://thuvienphapluat.vn)采集得到,本项目与thuvienphapluat.vn无任何关联。使用BeautifulSoup将HTML内容转换为Markdown格式,仅保留越南语文档,排除英语版本与技术标准(Tiêu chuẩn)。
---
## 许可与法律依据
根据越南法律,越南的法律文档(法律、法令、通知、决定及其他规范性文件)属于**公共领域**。依据《信息获取法》(第104/2016/QH13号)[《Luật Tiếp cận thông tin》](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-tiep-can-thong-tin-2016-280116.aspx)与《法律文档颁布法》(第64/2025/QH15号)[《Luật Ban hành Văn bản Quy phạm Pháp luật 2025 số 64/2025/QH15》](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-ban-hanh-van-ban-quy-pham-phap-luat-2025-so-64-2025-QH15-639239.aspx),国家机构发布的正式法律规范性文件必须免费公开。本**编译数据集**(包括采集、处理、元数据架构与Markdown转换)采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可。本项目仅用于研究目的。
---
## 引用格式
bibtex
@dataset{ngo_thinh_2026_vietnamese_legal,
title = {Vietnamese Legal Documents},
author = {Ngô, Thịnh},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents},
note = {518,255 Vietnamese legal documents compiled for research purposes}
}
提供机构:
Huymaeco



