dienmoc/vietnamese-legal-documents

Name: dienmoc/vietnamese-legal-documents
Creator: dienmoc
Published: 2026-03-22 13:11:19
License: 暂无描述

Hugging Face2026-03-22 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/dienmoc/vietnamese-legal-documents

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi license: cc-by-4.0 pretty_name: Vietnamese Legal Documents size_categories: - 100K<n<1M task_categories: - text-classification - text-generation - question-answering - summarization tags: - legal - vietnamese - law - government - NLP - text-mining configs: - config_name: metadata data_files: - split: data path: metadata/data-*.parquet - config_name: content data_files: - split: data path: content/data-*.parquet dataset_info: - config_name: metadata features: - name: id dtype: int64 - name: document_number dtype: string - name: title dtype: string - name: url dtype: string - name: legal_type dtype: string - name: legal_sectors dtype: string - name: issuing_authority dtype: string - name: issuance_date dtype: string - name: signers dtype: string num_rows: 518255 - config_name: content features: - name: id dtype: int64 - name: content dtype: string num_rows: 518255 --- # Vietnamese Legal Documents A comprehensive dataset of **518,255 Vietnamese legal documents** sourced from [thuvienphapluat.vn](https://thuvienphapluat.vn) — the largest Vietnamese legal document repository. The dataset covers laws, decrees, circulars, decisions, and other official documents issued by Vietnamese government bodies, spanning from **1924 to 2026**. --- ## At a Glance | | | |---|---| | 🗂️ **Total documents** | 518,255 | | 📅 **Date range** | 1924 – 2026 | | 🏛️ **Issuing authorities** | 1,335 unique bodies | | 📋 **Document types** | 36 unique types | | 🌐 **Language** | Vietnamese | | 💾 **Content size** | ~3.6 GB (parquet) | --- ## Dataset Structure This dataset is split into **two configs** to allow fast metadata access without loading the full text. | Config | Split | Rows | Size | Description | |---|---|---|---|---| | `metadata` | `data` | 518,255 | ~82 MB | 9 metadata columns, no text content | | `content` | `data` | 518,255 | ~3.6 GB | `id` + full markdown document text | Join on the `id` column to get both metadata and content. --- ## Load the Dataset ```python from datasets import load_dataset # Load metadata only (fast, ~82 MB) ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata") df = ds["data"].to_pandas() print(df.head()) # Load full text content (~3.6 GB) ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content") # Join metadata + content import pandas as pd meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas() text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas() df = meta.merge(text, on="id") print(df.columns.tolist()) ``` --- ## Schema ### `metadata` config | Column | Type | Description | |---|---|---| | `id` | int64 | Unique numeric document ID | | `document_number` | string | Official document number (e.g. `115/NQ-HĐBCQG`) | | `title` | string | Full Vietnamese title | | `url` | string | Source URL on thuvienphapluat.vn | | `legal_type` | string | Document type (Quyết định, Công văn, Nghị quyết, …) | | `legal_sectors` | string | Pipe-separated sector/topic tags | | `issuing_authority` | string | Name of the issuing government body | | `issuance_date` | string | Issue date in `DD/MM/YYYY` format | | `signers` | string | Pipe-separated `name:id` pairs of signatories | ### `content` config | Column | Type | Description | |---|---|---| | `id` | int64 | Document ID — join key with the `metadata` config | | `content` | string | Full document text converted to Markdown | --- ## Statistics ### Documents by Year ![Documents by year](charts/docs_by_year.png) ### Top 15 Document Types ![Document type distribution](charts/legal_type_distribution.png) ### Top 15 Legal Sectors ![Top sectors](charts/top_sectors.png) --- ## Use Cases - 🔍 **Legal information retrieval** — build search engines over Vietnamese law - 🤖 **LLM fine-tuning** — train or fine-tune language models on legal Vietnamese - 📊 **Legal NLP research** — NER, classification, summarization, QA - 📈 **Policy analysis** — track legislative trends over time - 🌏 **Low-resource NLP** — Vietnamese legal text is underrepresented in existing datasets --- ## Data Collection This is an independent personal research project. Documents were collected from [thuvienphapluat.vn](https://thuvienphapluat.vn) — a public legal document portal — via their sitemap. This project has **no affiliation with thuvienphapluat.vn**. HTML content was converted to Markdown using BeautifulSoup. Only Vietnamese-language documents were retained; English versions and technical standards (Tiêu chuẩn) were excluded. --- ## License & Legal Basis Vietnamese legal documents (laws, decrees, circulars, decisions, and other normative acts) are **public domain by Vietnamese law**. Under the [Law on Access to Information (Luật Tiếp cận thông tin, No. 104/2016/QH13)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-tiep-can-thong-tin-2016-280116.aspx) and the [Law on Promulgation of Legal Documents (No. 64/2025/QH15)](https://thuvienphapluat.vn/van-ban/Bo-may-hanh-chinh/Luat-ban-hanh-van-ban-quy-pham-phap-luat-2025-so-64-2025-QH15-639239.aspx), official legal normative documents issued by state agencies must be made publicly accessible free of charge. The **compiled dataset** (collection, processing, metadata schema, and Markdown conversion) is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). **Intended for research purposes only.** --- ## Citation ```bibtex @dataset{ngo_thinh_2026_vietnamese_legal, title = {Vietnamese Legal Documents}, author = {Ngô, Thịnh}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents}, note = {518,255 Vietnamese legal documents compiled for research purposes} } ```

--- language: - 越南语（Vietnamese） license: 知识共享署名4.0许可协议（CC BY 4.0） pretty_name: 越南法律文档 size_categories: - 10万至100万份 task_categories: - 文本分类 - 文本生成 - 问答任务 - 文本摘要 tags: - 法律 - 越南语 - 法学 - 政务 - 自然语言处理（NLP） - 文本挖掘 configs: - config_name: metadata data_files: - split: data path: metadata/data-*.parquet - config_name: content data_files: - split: data path: content/data-*.parquet dataset_info: - config_name: metadata features: - name: id dtype: int64 - name: document_number dtype: string - name: title dtype: string - name: url dtype: string - name: legal_type dtype: string - name: legal_sectors dtype: string - name: issuing_authority dtype: string - name: issuance_date dtype: string - name: signers dtype: string num_rows: 518255 - config_name: content features: - name: id dtype: int64 - name: content dtype: string num_rows: 518255 --- # 越南法律文档本数据集涵盖**518,255份越南法律文档**，数据源自越南最大的法律文档库[thuvienphapluat.vn](https://thuvienphapluat.vn)。数据集涵盖越南政府机构发布的法律、法令、通知、决定及其他官方文件，时间跨度为**1924年至2026年**。 --- ## 概览 | | | |---|---| | 🗂️ **总文档数** | 518,255 | | 📅 **时间范围** | 1924 – 2026 | | 🏛️ **发布机构** | 1335个唯一主体 | | 📋 **文档类型** | 36种唯一类型 | | 🌐 **语言** | 越南语 | | 💾 **内容体量** | 约3.6 GB（Parquet格式） | --- ## 数据集结构本数据集分为**两个配置项**，以便在无需加载完整文本的情况下快速访问元数据。 | 配置项 | 拆分方式 | 行数 | 体量 | 描述 | |---|---|---|---|---| | `metadata` | `data` | 518,255 | ~82 MB | 包含9个元数据列，无文本内容 | | `content` | `data` | 518,255 | ~3.6 GB | 包含`id`字段与完整Markdown格式的文档文本 | 可通过`id`列关联元数据与内容数据。 --- ## 加载数据集 python from datasets import load_dataset # 仅加载元数据（加载快速，体量约82 MB） ds = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata") df = ds["data"].to_pandas() print(df.head()) # 加载完整文本内容（体量约3.6 GB） ds_content = load_dataset("th1nhng0/vietnamese-legal-documents", "content") # 关联元数据与内容数据 import pandas as pd meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata")["data"].to_pandas() text = load_dataset("th1nhng0/vietnamese-legal-documents", "content")["data"].to_pandas() df = meta.merge(text, on="id") print(df.columns.tolist()) --- ## 数据模式 ### `metadata` 配置项 | 列名 | 数据类型 | 描述 | |---|---|---| | `id` | int64 | 唯一的文档数字ID | | `document_number` | string | 官方文档编号（例如`115/NQ-HĐBCQG`） | | `title` | string | 完整越南语标题 | | `url` | string | 源网站thuvienphapluat.vn上的链接 | | `legal_type` | string | 文档类型（如决定、公文、决议等） | | `legal_sectors` | string | 以竖线分隔的领域/主题标签 | | `issuing_authority` | string | 发布该文档的政府机构名称 | | `issuance_date` | string | 发布日期，格式为`DD/MM/YYYY` | | `signers` | string | 以竖线分隔的签署人`姓名:ID`对 | ### `content` 配置项 | 列名 | 数据类型 | 描述 | |---|---|---| | `id` | int64 | 文档ID，与`metadata`配置项的关联键 | | `content` | string | 转换为Markdown格式的完整文档文本 | --- ## 统计数据 ### 按年份分布的文档数 ![按年份分布的文档数](charts/docs_by_year.png) ### 前15种文档类型 ![文档类型分布](charts/legal_type_distribution.png) ### 前15个法律领域 ![Top领域分布](charts/top_sectors.png) --- ## 应用场景 - 🔍 **法律信息检索**——构建越南语法律搜索引擎 - 🤖 **大语言模型（LLM）微调**——基于越南语法律文本训练或微调语言模型 - 📊 **法律自然语言处理研究**——命名实体识别、分类、摘要、问答等任务 - 📈 **政策分析**——追踪长期立法趋势 - 🌏 **低资源自然语言处理**——越南语法律文本在现有数据集中占比极低 --- ## 数据采集本项目为独立个人研究项目。文档通过网站地图从公共法律文档门户[thuvienphapluat.vn](https://thuvienphapluat.vn)采集，本项目与thuvienphapluat.vn无任何隶属关系。使用BeautifulSoup将HTML内容转换为Markdown格式。仅保留越南语文档，排除英文版本与技术标准（Tiêu chuẩn）。 --- ## 许可与法律依据根据越南法律，越南法律文档（法律、法令、通知、决定及其他规范性文件）属于**公共领域**。根据《信息获取法》（Luật Tiếp cận thông tin，第104/2016/QH13号）与《法律文件颁布法》（第64/2025/QH15号），国家机构发布的官方法律规范性文件必须免费向公众开放。本**编译数据集**（包括数据采集、处理、元数据架构与Markdown转换流程）按照[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)许可协议发布。本数据集仅用于研究目的。 --- ## 引用 bibtex @dataset{ngo_thinh_2026_vietnamese_legal, title = {Vietnamese Legal Documents}, author = {Ngô, Thịnh}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents}, note = {518,255份越南法律文档，专为研究目的编译} }

提供机构：

dienmoc

5,000+

优质数据集

54 个

任务类型

进入经典数据集