five

vanhthefirst/vietnamese-legal-documents

收藏
Hugging Face2026-04-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/vanhthefirst/vietnamese-legal-documents
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi license: cc-by-4.0 pretty_name: Vietnamese Legal Documents size_categories: - 100K<n<1M task_categories: - text-classification - text-generation - question-answering - summarization tags: - legal - vietnamese - law - government configs: - config_name: metadata data_files: - split: data path: data/metadata.parquet - config_name: relationships data_files: - split: data path: data/relationships.parquet - config_name: content data_files: - split: data path: data/content.parquet dataset_info: - config_name: metadata features: - name: id dtype: int64 - name: title dtype: string - name: so_ky_hieu dtype: string - name: ngay_ban_hanh dtype: string - name: loai_van_ban dtype: string - name: ngay_co_hieu_luc dtype: string - name: ngay_het_hieu_luc dtype: string - name: nguon_thu_thap dtype: string - name: ngay_dang_cong_bao dtype: string - name: nganh dtype: string - name: linh_vuc dtype: string - name: co_quan_ban_hanh dtype: string - name: chuc_danh dtype: string - name: nguoi_ky dtype: string - name: pham_vi dtype: string - name: thong_tin_ap_dung dtype: string - name: tinh_trang_hieu_luc dtype: string num_rows: 153420 - config_name: relationships features: - name: doc_id dtype: int64 - name: other_doc_id dtype: string - name: relationship dtype: string num_rows: 897890 - config_name: content features: - name: id dtype: string - name: content_html dtype: string num_rows: 178665 --- # Vietnamese Legal Documents A comprehensive collection of Vietnamese legal documents — laws, decrees, circulars, decisions, and other normative acts — sourced from [vbpl.vn](https://vbpl.vn), the official Government Legal Document Portal operated by the Ministry of Justice. The dataset includes structured metadata for every document, raw HTML full-text content, and a rich graph of cross-document legal relationships (amendments, citations, repeals, etc.). - **Curated by:** [Thịnh Ngô](https://huggingface.co/th1nhng0) - **Source:** [vbpl.vn](https://vbpl.vn) - **Language:** Vietnamese - **License:** [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) ## Quick Start ```python from datasets import load_dataset # Metadata for all 153k documents meta = load_dataset("th1nhng0/vietnamese-legal-documents", "metadata", split="data") print(meta.to_pandas().head()) # Cross-document relationships (amendments, citations, repeals, …) rels = load_dataset("th1nhng0/vietnamese-legal-documents", "relationships", split="data") print(rels.to_pandas().head()) # Full-text HTML content for ~149k documents content = load_dataset("th1nhng0/vietnamese-legal-documents", "content", split="data") print(content.to_pandas().head()) ``` Join the two on `id` (metadata) ↔ `doc_id` (relationships): ```python import pandas as pd df = meta.to_pandas() rel = rels.to_pandas() # Find all documents that cite document 10420 citing = rel[rel["other_doc_id"] == "10420"].merge(df, left_on="doc_id", right_on="id") print(citing[["id", "title", "relationship"]]) ``` ## Dataset Structure The dataset has three configs: | Config | Rows | Description | |---|---|---| | `metadata` | 153,420 | One row per document — 16 metadata fields | | `content` | 178,665 | Raw HTML full-text content | | `relationships` | 897,890 | Directed edges between documents | ### `metadata` | Column | Description | |---|---| | `id` | Unique document ID (int) | | `title` | Full Vietnamese title | | `so_ky_hieu` | Official number, e.g. `115/NQ-HĐBCQG` | | `ngay_ban_hanh` | Issuance date (`DD/MM/YYYY`) | | `loai_van_ban` | Type — Quyết định, Nghị quyết, Thông tư, … | | `ngay_co_hieu_luc` | Effective date | | `ngay_het_hieu_luc` | Expiry date (empty if still in effect) | | `nguon_thu_thap` | Collection source (e.g. Công báo) | | `ngay_dang_cong_bao` | Official Gazette publication date | | `nganh` | Sector — Tài chính, Y tế, … | | `linh_vuc` | Legal field / sub-domain | | `co_quan_ban_hanh` | Issuing authority (551 unique bodies) | | `chuc_danh` | Signatory title — Chủ tịch, Bộ trưởng, … | | `nguoi_ky` | Signatory name | | `pham_vi` | Geographical scope | | `thong_tin_ap_dung` | Implementation note | | `tinh_trang_hieu_luc` | Effect status — Còn hiệu lực, Hết hiệu lực toàn bộ, … | ### `content` | Column | Description | |---|---| | `id` | Document ID (join key → `metadata.id`) | | `content_html` | Raw HTML body of the document | > **Note:** Some documents in `metadata` do not have a corresponding entry in `content` because the portal only provides PDF scans for those documents (no HTML version available). ### `relationships` | Column | Description | |---|---| | `doc_id` | Source document ID (join key → `metadata.id`) | | `other_doc_id` | Target document ID | | `relationship` | Edge label | ## Statistics ![Documents by year](charts/docs_by_year.png) ![Document type distribution](charts/legal_type_distribution.png) ![Top issuing authorities](charts/top_authorities.png) ## Data Collection All data was scraped from [vbpl.vn](https://vbpl.vn) using a [Scrapy](https://scrapy.org/) crawler (included under [`crawler/`](crawler/)). Metadata and cross-document relationships were extracted directly from the portal's structured pages. ```bash cd crawler scrapy crawl vbpl -a seed_file=data/ids.txt # basic scrapy crawl vbpl -a seed_file=data/ids.txt -a proxy_file=proxies.txt # with proxies scrapy crawl vbpl -a seed_file=data/ids.txt -a resume=1 # resume ``` Output: `data/raw.jsonl` ## Limitations - Coverage depends on what [vbpl.vn](https://vbpl.vn) has indexed; older or undigitized documents may be missing. - Effect status reflects the portal at crawl time and may lag behind real-world changes. - This is a snapshot, not a live mirror. Always cross-check with the portal for authoritative status. ## Privacy The dataset contains names of document signatories (public officials acting in their official capacity). No private citizen data is included. ## Citation ```bibtex @dataset{ngo_thinh_2026_vietnamese_legal, title = {Vietnamese Legal Documents}, author = {Thịnh Ngô}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents}, } ``` ## License Vietnamese legal documents are **public domain** under the [Law on Access to Information (No. 104/2016/QH13)](https://chinhphu.vn/default.aspx?pageid=27160&docid=184568) and the [Law on Promulgation of Legal Documents (No. 64/2025/QH15)](https://chinhphu.vn/?pageid=27160&docid=213327&classid=1&typegroupid=3). The compiled dataset (schema, processing, curation) is released under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/). Not a substitute for legal advice. > **Note:** The previous version (sourced from thuvienphapluat.vn) was discontinued following a content removal request. This is a fully rebuilt dataset sourced exclusively from [vbpl.vn](https://vbpl.vn), the official Government Legal Document Portal. Full-text document content in Markdown is coming in a future release.
提供机构:
vanhthefirst
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作