duyet/vietnamese-legal-instruct

Name: duyet/vietnamese-legal-instruct
Creator: duyet
Published: 2026-04-10 07:59:11
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/duyet/vietnamese-legal-instruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - vi license: cc-by-4.0 task_categories: - text-generation - question-answering tags: - vietnamese - legal - instruction-tuning - unsloth - fine-tuning - law size_categories: - 100K<n<1M --- # Vietnamese Legal Instruction Dataset **Dataset**: [huggingface.co/datasets/duyet/vietnamese-legal-instruct](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct) | **Source code**: [github.com/duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) Instruction-following dataset built from [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) — 127K Vietnamese legal documents from [vbpl.vn](https://vbpl.vn) (Government Legal Document Portal, Ministry of Justice). **467,732 training pairs** across 14 QA types with deep Vietnamese legal hierarchy knowledge. Every document has a `full_text` pair for content recall, plus 5 short metadata recall types for memorization. ## Statistics | Metric | Value | |--------|-------| | Total records | 467,732 | | Source documents | 116,933 (from 127,271 unique, filtered by length) | | QA types | 14 | | Train split | 444,346 (95%) | | Test split | 23,386 (5%) | ### QA Type Distribution | Type | Count | % | Description | |------|------:|--:|-------------| | `full_text` | 116,933 | 25.0 | Full document content (1 per doc for content recall) | | `scope` | 35,715 | 7.6 | Scope, applicability, effective dates | | `classify` | 35,401 | 7.6 | Document type & hierarchy position | | `summarize` | 35,226 | 7.5 | 3-5 sentence structured summary | | `meta_date` | 29,247 | 6.3 | Issue date & effective date (short) | | `explain_simple` | 29,176 | 6.2 | Plain language for non-lawyers | | `meta_issuer` | 29,059 | 6.2 | Issuing authority (short) | | `meta_title` | 29,049 | 6.2 | Title & subject (short) | | `meta_type` | 29,013 | 6.2 | Document type & hierarchy level (short) | | `qa_practical` | 28,982 | 6.2 | Practical compliance Q&A | | `meta_status` | 22,201 | 4.7 | Current legal status (short) | | `key_provisions` | 22,061 | 4.7 | Key articles and provisions | | `legal_basis` | 17,867 | 3.8 | Legal basis chain analysis | | `amounts` | 7,802 | 1.7 | Monetary amounts & percentages | ### Document Type Distribution (top 10) | Document Type | Count | |---------------|------:| | Quyết định (Decisions) | 137,524 | | Nghị quyết (Resolutions) | 40,404 | | Thông tư (Circulars) | 23,544 | | Chỉ thị (Directives) | 16,198 | | Nghị định (Decrees) | 6,610 | | Thông tư liên tịch (Joint Circulars) | 4,834 | | Sắc lệnh (Ordinances) | 1,950 | | Nghị Quyết | 864 | | Lệnh (Orders) | 750 | | Pháp lệnh (Ordinances) | 404 | ## Format Unsloth-compatible conversation format (`conversations` column with `role`/`content`): ```json { "conversations": [ {"role": "system", "content": "Bạn là chuyên gia pháp luật Việt Nam..."}, {"role": "user", "content": "Tóm tắt văn bản sau..."}, {"role": "assistant", "content": "Văn bản này quy định về..."} ], "source_id": "12345", "document_type": "Nghị định", "qa_type": "summarize" } ``` ## Vietnamese Legal Hierarchy The dataset encodes knowledge of the Vietnamese legal document hierarchy (per Luật ban hành VBQPPL 2015): ``` 1. Hiến pháp (Constitution) 2. Luật, Bộ luật (Laws, Codes) — Quốc hội 3. Pháp lệnh, Lệnh (Ordinances, Orders) — UBTVQH / Chủ tịch nước 4. Nghị định, Nghị quyết (Decrees, Resolutions) — Chính phủ 5. Thông tư, Quyết định (Circulars, Decisions) — Bộ trưởng / UBND 6. Chỉ thị (Directives) — Thủ tướng / Chủ tịch UBND ``` ## Quick Start with Unsloth ```python from unsloth import FastModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset model, tokenizer = FastModel.from_pretrained( model_name="unsloth/gemma-3-4B-it", max_seq_length=4096, load_in_4bit=True, ) tokenizer = get_chat_template(tokenizer, chat_template="gemma") dataset = load_dataset("duyet/vietnamese-legal-instruct", split="train") def formatting_prompts_func(examples): texts = [ tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in examples["conversations"] ] return {"text": texts} dataset = dataset.map(formatting_prompts_func, batched=True) from trl import SFTTrainer, SFTConfig trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig( per_device_train_batch_size=2, learning_rate=2e-4, packing=True, # 2-5x speedup for mixed-length data ), ) trainer.train() ``` ## Generation GitHub: [duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) | Source code: [`generate.py`](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct/blob/main/generate.py) Built with 14 local QA generators (no LLM API calls needed): - 9 analysis types: summarize, key_provisions, qa_practical, explain_simple, scope, classify, legal_basis, amounts, full_text - 5 short metadata recall types: meta_type, meta_issuer, meta_date, meta_title, meta_status - Vietnamese legal hierarchy knowledge baked into system prompts - Quality-filtered: min 60 char responses, no within-doc duplicates - Every doc gets 1 full_text + 3 random QA types = 4 records per doc - DuckDB-backed cache for memory-efficient processing (~145 MB RSS) ```bash # Reproduce the dataset pip install requests python-dotenv beautifulsoup4 lxml pyarrow datasets duckdb python generate.py --fresh --qa-types 3 --upload duyet/vietnamese-legal-instruct ``` ## Source - **Original dataset**: [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) by Thịnh Ngô - **Data source**: [vbpl.vn](https://vbpl.vn) — Vietnamese Ministry of Justice - **License**: CC BY 4.0 ## Citation ```bibtex @dataset{vietnamese_legal_instruct_2026, title = {Vietnamese Legal Instruction Dataset}, author = {Duyet Le}, year = {2026}, publisher = {Hugging Face}, doi = {10.57967/hf/8343}, url = {https://huggingface.co/datasets/duyet/vietnamese-legal-instruct}, note = {468K instruction pairs from 127K Vietnamese legal documents, 14 QA types} } ```

--- language: - 越南语（Vietnamese） license: CC BY 4.0 task_categories: - 文本生成（text-generation） - 问答（question-answering） tags: - 越南语（Vietnamese） - 法律（legal） - 指令微调（instruction-tuning） - Unsloth - 微调（fine-tuning） - 法律（law） size_categories: - 10万<样本数<100万 --- # 越南语法律指令数据集（Vietnamese Legal Instruction Dataset） **数据集**: [https://huggingface.co/datasets/duyet/vietnamese-legal-instruct] | **源代码**: [https://github.com/duyet/vietnamese-legal-documents-dataset] 本指令遵循数据集基于[th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents)构建，原始数据集包含来自[越南司法部政府法律文档门户vbpl.vn](https://vbpl.vn)的12.7万份越南法律文档。本数据集包含467732条训练样本对，覆盖14类问答任务，蕴含深度的越南法律层级体系知识。每份文档均配有用于内容召回的`full_text`（全文）样本对，以及5类用于记忆性检索的短元数据问答样本。 ## 统计信息 | 指标 | 数值 | |--------|-------| | 总记录数 | 467,732 | | 源文档数 | 116,933（从127,271份唯一文档中按长度筛选得到） | | 问答类型数 | 14 | | 训练集划分 | 444,346（占比95%） | | 测试集划分 | 23,386（占比5%） | ## 问答类型分布 | 问答类型 | 样本数 | 占比 | 描述 | |------|------:|--:|-------------| | `full_text` | 116,933 | 25.0 | 全文档内容（每份文档对应1条，用于内容召回） | | `scope` | 35,715 | 7.6 | 适用范围、生效日期 | | `classify` | 35,401 | 7.6 | 文档类型与层级定位 | | `summarize` | 35,226 | 7.5 | 3-5句结构化摘要 | | `meta_date` | 29,247 | 6.3 | 发布日期与生效日期（短格式） | | `explain_simple` | 29,176 | 6.2 | 面向非法律从业者的通俗解释 | | `meta_issuer` | 29,059 | 6.2 | 发布机构（短格式） | | `meta_title` | 29,049 | 6.2 | 标题与主题（短格式） | | `meta_type` | 29,013 | 6.2 | 文档类型与层级级别（短格式） | | `qa_practical` | 28,982 | 6.2 | 实操合规类问答 | | `meta_status` | 22,201 | 4.7 | 当前法律状态（短格式） | | `key_provisions` | 22,061 | 4.7 | 核心条款与条文 | | `legal_basis` | 17,867 | 3.8 | 法律依据链条分析 | | `amounts` | 7,802 | 1.7 | 金额与百分比数值 | ## 文档类型分布（前10名） | 文档类型 | 样本数 | |---------------|------:| | Quyết định（决定/Decisions） | 137,524 | | Nghị quyết（决议/Resolutions） | 40,404 | | Thông tư（通知/Circulars） | 23,544 | | Chỉ thị（指令/Directives） | 16,198 | | Nghị định（政令/Decrees） | 6,610 | | Thông tư liên tịch（联合通知/Joint Circulars） | 4,834 | | Sắc lệnh（法令/Ordinances） | 1,950 | | Nghị Quyết（决议） | 864 | | Lệnh（命令/Orders） | 750 | | Pháp lệnh（法令/Ordinances） | 404 | ## 数据格式适配Unsloth的对话格式（包含`conversations`列，内含`role`（角色）与`content`（内容）字段）： json { "conversations": [ {"role": "system", "content": "Bạn là chuyên gia pháp luật Việt Nam..."}, {"role": "user", "content": "Tóm tắt văn bản sau..."}, {"role": "assistant", "content": "Văn bản này quy định về..."} ], "source_id": "12345", "document_type": "Nghị định", "qa_type": "summarize" } ## 越南法律层级体系本数据集编码了越南法律文档层级体系知识（依据2015年《法律文件颁布法》（Luật ban hành VBQPPL 2015））： 1. 宪法（Hiến pháp / Constitution） 2. 法律、法典（Luật, Bộ luật / Laws, Codes）—— 由国会制定 3. 法令、命令（Pháp lệnh, Lệnh / Ordinances, Orders）—— 由国会常务委员会/国家主席发布 4. 政令、决议（Nghị định, Nghị quyết / Decrees, Resolutions）—— 由政府发布 5. 通知、决定（Thông tư, Quyết định / Circulars, Decisions）—— 由部长/人民委员会发布 6. 指令（Chỉ thị / Directives）—— 由总理/人民委员会主席发布 ## 快速上手Unsloth python from unsloth import FastModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset model, tokenizer = FastModel.from_pretrained( model_name="unsloth/gemma-3-4B-it", max_seq_length=4096, load_in_4bit=True, ) tokenizer = get_chat_template(tokenizer, chat_template="gemma") dataset = load_dataset("duyet/vietnamese-legal-instruct", split="train") def formatting_prompts_func(examples): texts = [ tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in examples["conversations"] ] return {"text": texts} dataset = dataset.map(formatting_prompts_func, batched=True) from trl import SFTTrainer, SFTConfig trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig( per_device_train_batch_size=2, learning_rate=2e-4, packing=True, # 2-5x speedup for mixed-length data ), ) trainer.train() ## 生成流程 GitHub仓库：[duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) | 源代码：[`generate.py`](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct/blob/main/generate.py) 本数据集通过14种本地问答生成器构建，无需调用大语言模型API： - 9类分析型任务：摘要、核心条款、实操合规问答、通俗解释、适用范围、文档分类、法律依据分析、金额数值提取、全文内容召回 - 5类短元数据检索任务：文档类型、发布机构、发布/生效日期、标题主题、法律状态 - 系统提示中嵌入了越南法律层级体系知识 - 经过质量筛选：响应内容不少于60字符，无文档内重复样本 - 每份文档生成1条`full_text`样本与3条随机问答样本，即每份文档对应4条数据记录 - 使用DuckDB作为缓存，实现内存高效处理（常驻内存约145MB RSS） bash # Reproduce the dataset pip install requests python-dotenv beautifulsoup4 lxml pyarrow datasets duckdb python generate.py --fresh --qa-types 3 --upload duyet/vietnamese-legal-instruct ## 数据来源 - **原始数据集**: [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) by Thịnh Ngô - **数据来源**: [vbpl.vn](https://vbpl.vn) — 越南司法部 - **许可证**: CC BY 4.0 ## 引用格式 bibtex @dataset{vietnamese_legal_instruct_2026, title = {越南语法律指令数据集（Vietnamese Legal Instruction Dataset）}, author = {Duyet Le}, year = {2026}, publisher = {Hugging Face}, doi = {10.57967/hf/8343}, url = {https://huggingface.co/datasets/duyet/vietnamese-legal-instruct}, note = {包含来自12.7万份越南法律文档的46.8万条指令样本对，覆盖14类问答任务} }

提供机构：

duyet

5,000+

优质数据集

54 个

任务类型

进入经典数据集