Huymaeco/vietnamese-legal-documents

Name: Huymaeco/vietnamese-legal-documents
Creator: Huymaeco
Published: 2026-03-22 08:34:00
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Huymaeco/vietnamese-legal-documents

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi license: cc-by-4.0 pretty_name: Vietnamese Legal Documents size_categories: - 100K<n<1M task_categories: - text-classification - text-generation - question-answering - summarization tags: - legal - vietnamese - law - government - NLP - text-mining configs: - config_name: metadata data_files: - split: data path: metadata/data-*.parquet - config_name: content data_files: - split: data path: content/data-*.parquet dataset_info: - config_name: metadata features: - name: id dtype: int64 - name: document_number dtype: string - name: title dtype: string - name: url dtype: string - name: legal_type dtype: string - name: legal_sectors dtype: string - name: issuing_authority dtype: string - name: issuance_date dtype: string - name: signers dtype: string num_rows: 518255 - config_name: content features: - name: id dtype: int64 - name: content dtype: string num_rows: 518255 --- # Vietnamese Legal Documents A comprehensive dataset of **518,255 Vietnamese legal documents** sourced from [thuvienphapluat.vn](https://thuvienphapluat.vn) — the largest Vietnamese legal document repository. The dataset covers laws, decrees, circulars, decisions, and other official documents issued by Vietnamese government bodies, spanning from **1924 to 2026**. --- ## At a Glance | | | |---|---| | 🗂️ **Total documents** | 518,255 | | 📅 **Date range** | 1924 – 2026 | | 🏛️ **Issuing authorities** | 1,335 unique bodies | | 📋 **Document types** | 36 unique types | | 🌐 **Language** | Vietnamese | | 💾 **Content size** | ~3.6 GB (parquet) | --- ## Dataset Structure This dataset is split into **two configs** to allow fast metadata access without loading the full text. | Config | Split | Rows | Size | Description | |---|---|---|---|---| | `metadata` | `data` | 518,255 | ~82 MB | 9 metadata columns, no text content | | `content` | `data` | 518,255 | ~3.6 GB | `id` + full markdown document text | Join on the `id` column to get both metadata and content. --- ## Load the Dataset ```python from datasets import load_dataset # Load metadata only (fast, ~82 MB) ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata") df = ds["data"].to_pandas() print(df.head()) # Load full text content (~3.6 GB) ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content") # Join metadata + content import pandas as pd meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas() text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas() df = meta.merge(text, on="id") print(df.columns.tolist()) ``` --- ## Schema ### `metadata` config | Column | Type | Description | |---|---|---| | `id` | int64 | Unique numeric document ID | | `document_number` | string | Official document number (e.g. `115/NQ-HĐBCQG`) | | `title` | string | Full Vietnamese title | | `url` | string | Source URL on thuvienphapluat.vn | | `legal_type` | string | Document type (Quyết định, Công văn, Nghị quyết, …) | | `legal_sectors` | string | Pipe-separated sector/topic tags | | `issuing_authority` | string | Name of the issuing government body | | `issuance_date` | string | Issue date in `DD/MM/YYYY` format | | `signers` | string | Pipe-separated `name:id` pairs of signatories | ### `content` config | Column | Type | Description | |---|---|---| | `id` | int64 | Document ID — join key with the `metadata` config | | `content` | string | Full document text converted to Markdown | --- ## Statistics ### Documents by Year ![Documents by year](charts/docs_by_year.png) ### Top 15 Document Types ![Document type distribution](charts/legal_type_distribution.png) ### Top 15 Legal Sectors ![Top sectors](charts/top_sectors.png) --- ## Use Cases - 🔍 **Legal information retrieval** — build search engines over Vietnamese law - 🤖 **LLM fine-tuning** — train or fine-tune language models on legal Vietnamese - 📊 **Legal NLP research** — NER, classification, summarization, QA - 📈 **Policy analysis** — track legislative trends over time - 🌏 **Low-resource NLP** — Vietnamese legal text is underrepresented in existing datasets --- ## Data Collection This is an independent personal research project. Documents were collected from [thuvienphapluat.vn](https://thuvienphapluat.vn) — a public legal document portal — via their sitemap and mobile API. This project has **no affiliation with thuvienphapluat.vn**. HTML content was converted to Markdown using BeautifulSoup. Only Vietnamese-language documents were retained; English versions and technical standards (Tiêu chuẩn) were excluded. --- ## License & Legal Basis Vietnamese legal documents (laws, decrees, circulars, decisions, and other normative acts) are **public domain by Vietnamese law**. Under the [Law on Access to Information (Luật Tiếp cận thông tin, No. 104/2016/QH13)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-tiep-can-thong-tin-2016-280116.aspx) and the [Law on Promulgation of Legal Documents (No. 64/2025/QH15)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-ban-hanh-van-ban-quy-pham-phap-luat-2025-so-64-2025-QH15-639239.aspx), official legal normative documents issued by state agencies must be made publicly accessible free of charge. The **compiled dataset** (collection, processing, metadata schema, and Markdown conversion) is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). **Intended for research purposes only.** --- ## Citation ```bibtex @dataset{ngo_thinh_2026_vietnamese_legal, title = {Vietnamese Legal Documents}, author = {Ngô, Thịnh}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents}, note = {518,255 Vietnamese legal documents compiled for research purposes} } ```

语言： - 越南语（vi）许可协议：CC BY 4.0 友好名称：越南法律文档规模类别： - 10万 < n < 100万任务类别： - 文本分类 - 文本生成 - 问答 - 摘要标签： - 法律 - 越南语 - 法学 - 政务 - 自然语言处理（Natural Language Processing，NLP） - 文本挖掘配置项： - 配置名称：元数据数据文件： - 拆分：data 路径：metadata/data-*.parquet - 配置名称：内容数据文件： - 拆分：data 路径：content/data-*.parquet 数据集信息： - 配置名称：元数据特征字段： - 字段名：id 数据类型：int64 - 字段名：document_number 数据类型：字符串 - 字段名：title 数据类型：字符串 - 字段名：url 数据类型：字符串 - 字段名：legal_type 数据类型：字符串 - 字段名：legal_sectors 数据类型：字符串 - 字段名：issuing_authority 数据类型：字符串 - 字段名：issuance_date 数据类型：字符串 - 字段名：signers 数据类型：字符串行数：518255 - 配置名称：内容特征字段： - 字段名：id 数据类型：int64 - 字段名：content 数据类型：字符串行数：518255 --- # 越南法律文档这是一套包含518,255份越南法律文档的综合性数据集，数据源自越南最大的法律文档库[thuvienphapluat.vn](https://thuvienphapluat.vn)。本数据集涵盖越南政府机构发布的法律、法令、通知、决定及其他官方文件，时间跨度为**1924年至2026年**。 --- ## 概览 | | | |---|---| | 🗂️ **总文档数** | 518,255 | | 📅 **时间范围** | 1924 – 2026 | | 🏛️ **发布机构** | 1,335个独特主体 | | 📋 **文档类型** | 36种独特类型 | | 🌐 **语言** | 越南语 | | 💾 **内容体量** | ~3.6 GB（Parquet格式） | --- ## 数据集结构本数据集分为**两个配置项**，以便在无需加载完整文本的情况下快速访问元数据。 | 配置项 | 拆分 | 行数 | 体量 | 描述 | |---|---|---|---|---| | `metadata` | `data` | 518,255 | ~82 MB | 包含9个元数据字段，无文本内容 | | `content` | `data` | 518,255 | ~3.6 GB | 包含`id`与完整Markdown格式的文档文本 | 可通过`id`列进行关联，以同时获取元数据与文档内容。 --- ## 数据集加载 python from datasets import load_dataset # 仅加载元数据（加载速度快，约82 MB） ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata") df = ds["data"].to_pandas() print(df.head()) # 加载完整文本内容（约3.6 GB） ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content") # 关联元数据与内容 import pandas as pd meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas() text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas() df = meta.merge(text, on="id") print(df.columns.tolist()) --- ## 数据模式 ### `metadata` 配置项 | 字段名 | 数据类型 | 描述 | |---|---|---| | `id` | int64 | 唯一数字文档ID | | `document_number` | 字符串 | 官方文档编号（例如`115/NQ-HĐBCQG`） | | `title` | 字符串 | 完整越南语标题 | | `url` | 字符串 | thuvienphapluat.vn上的来源链接 | | `legal_type` | 字符串 | 文档类型（如决定、公文、决议等） | | `legal_sectors` | 字符串 | 以竖线分隔的领域/主题标签 | | `issuing_authority` | 字符串 | 发布该文档的政府机构名称 | | `issuance_date` | 字符串 | 发布日期，格式为`DD/MM/YYYY` | | `signers` | 字符串 | 以竖线分隔的`姓名:ID`格式签署人对 | ### `content` 配置项 | 字段名 | 数据类型 | 描述 | |---|---|---| | `id` | int64 | 文档ID——与`metadata`配置项的关联键 | | `content` | 字符串 | 转换为Markdown格式的完整文档文本 | --- ## 统计信息 ### 按年份分布的文档 ![文档按年份分布](charts/docs_by_year.png) ### 前15种文档类型 ![文档类型分布](charts/legal_type_distribution.png) ### 前15个法律领域 ![Top15法律领域](charts/top_sectors.png) --- ## 应用场景 - 🔍 **法律信息检索**：构建越南法律搜索引擎 - 🤖 **大语言模型（Large Language Model，LLM）微调**：基于越南语法律文本训练或微调语言模型 - 📊 **法律自然语言处理研究**：开展命名实体识别、分类、摘要、问答等任务 - 📈 **政策分析**：追踪长期立法趋势 - 🌏 **低资源自然语言处理**：越南语法律文本在现有数据集中占比极低 --- ## 数据采集本项目为独立个人研究项目。文档通过网站地图与移动API从公共法律文档门户[thuvienphapluat.vn](https://thuvienphapluat.vn)采集得到，本项目与thuvienphapluat.vn无任何关联。使用BeautifulSoup将HTML内容转换为Markdown格式，仅保留越南语文档，排除英语版本与技术标准（Tiêu chuẩn）。 --- ## 许可与法律依据根据越南法律，越南的法律文档（法律、法令、通知、决定及其他规范性文件）属于**公共领域**。依据《信息获取法》（第104/2016/QH13号）[《Luật Tiếp cận thông tin》](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-tiep-can-thong-tin-2016-280116.aspx)与《法律文档颁布法》（第64/2025/QH15号）[《Luật Ban hành Văn bản Quy phạm Pháp luật 2025 số 64/2025/QH15》](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-ban-hanh-van-ban-quy-pham-phap-luat-2025-so-64-2025-QH15-639239.aspx)，国家机构发布的正式法律规范性文件必须免费公开。本**编译数据集**（包括采集、处理、元数据架构与Markdown转换）采用[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可。本项目仅用于研究目的。 --- ## 引用格式 bibtex @dataset{ngo_thinh_2026_vietnamese_legal, title = {Vietnamese Legal Documents}, author = {Ngô, Thịnh}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents}, note = {518,255 Vietnamese legal documents compiled for research purposes} }

提供机构：

Huymaeco

5,000+

优质数据集

54 个

任务类型

进入经典数据集