duyet/vietnamese-legal-instruct
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/duyet/vietnamese-legal-instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- vi
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
tags:
- vietnamese
- legal
- instruction-tuning
- unsloth
- fine-tuning
- law
size_categories:
- 100K<n<1M
---
# Vietnamese Legal Instruction Dataset
**Dataset**: [huggingface.co/datasets/duyet/vietnamese-legal-instruct](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct) | **Source code**: [github.com/duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset)
Instruction-following dataset built from [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) — 127K Vietnamese legal documents from [vbpl.vn](https://vbpl.vn) (Government Legal Document Portal, Ministry of Justice).
**467,732 training pairs** across 14 QA types with deep Vietnamese legal hierarchy knowledge. Every document has a `full_text` pair for content recall, plus 5 short metadata recall types for memorization.
## Statistics
| Metric | Value |
|--------|-------|
| Total records | 467,732 |
| Source documents | 116,933 (from 127,271 unique, filtered by length) |
| QA types | 14 |
| Train split | 444,346 (95%) |
| Test split | 23,386 (5%) |
### QA Type Distribution
| Type | Count | % | Description |
|------|------:|--:|-------------|
| `full_text` | 116,933 | 25.0 | Full document content (1 per doc for content recall) |
| `scope` | 35,715 | 7.6 | Scope, applicability, effective dates |
| `classify` | 35,401 | 7.6 | Document type & hierarchy position |
| `summarize` | 35,226 | 7.5 | 3-5 sentence structured summary |
| `meta_date` | 29,247 | 6.3 | Issue date & effective date (short) |
| `explain_simple` | 29,176 | 6.2 | Plain language for non-lawyers |
| `meta_issuer` | 29,059 | 6.2 | Issuing authority (short) |
| `meta_title` | 29,049 | 6.2 | Title & subject (short) |
| `meta_type` | 29,013 | 6.2 | Document type & hierarchy level (short) |
| `qa_practical` | 28,982 | 6.2 | Practical compliance Q&A |
| `meta_status` | 22,201 | 4.7 | Current legal status (short) |
| `key_provisions` | 22,061 | 4.7 | Key articles and provisions |
| `legal_basis` | 17,867 | 3.8 | Legal basis chain analysis |
| `amounts` | 7,802 | 1.7 | Monetary amounts & percentages |
### Document Type Distribution (top 10)
| Document Type | Count |
|---------------|------:|
| Quyết định (Decisions) | 137,524 |
| Nghị quyết (Resolutions) | 40,404 |
| Thông tư (Circulars) | 23,544 |
| Chỉ thị (Directives) | 16,198 |
| Nghị định (Decrees) | 6,610 |
| Thông tư liên tịch (Joint Circulars) | 4,834 |
| Sắc lệnh (Ordinances) | 1,950 |
| Nghị Quyết | 864 |
| Lệnh (Orders) | 750 |
| Pháp lệnh (Ordinances) | 404 |
## Format
Unsloth-compatible conversation format (`conversations` column with `role`/`content`):
```json
{
"conversations": [
{"role": "system", "content": "Bạn là chuyên gia pháp luật Việt Nam..."},
{"role": "user", "content": "Tóm tắt văn bản sau..."},
{"role": "assistant", "content": "Văn bản này quy định về..."}
],
"source_id": "12345",
"document_type": "Nghị định",
"qa_type": "summarize"
}
```
## Vietnamese Legal Hierarchy
The dataset encodes knowledge of the Vietnamese legal document hierarchy (per Luật ban hành VBQPPL 2015):
```
1. Hiến pháp (Constitution)
2. Luật, Bộ luật (Laws, Codes) — Quốc hội
3. Pháp lệnh, Lệnh (Ordinances, Orders) — UBTVQH / Chủ tịch nước
4. Nghị định, Nghị quyết (Decrees, Resolutions) — Chính phủ
5. Thông tư, Quyết định (Circulars, Decisions) — Bộ trưởng / UBND
6. Chỉ thị (Directives) — Thủ tướng / Chủ tịch UBND
```
## Quick Start with Unsloth
```python
from unsloth import FastModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
model, tokenizer = FastModel.from_pretrained(
model_name="unsloth/gemma-3-4B-it",
max_seq_length=4096,
load_in_4bit=True,
)
tokenizer = get_chat_template(tokenizer, chat_template="gemma")
dataset = load_dataset("duyet/vietnamese-legal-instruct", split="train")
def formatting_prompts_func(examples):
texts = [
tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
for convo in examples["conversations"]
]
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
args=SFTConfig(
per_device_train_batch_size=2, learning_rate=2e-4,
packing=True, # 2-5x speedup for mixed-length data
),
)
trainer.train()
```
## Generation
GitHub: [duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) | Source code: [`generate.py`](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct/blob/main/generate.py)
Built with 14 local QA generators (no LLM API calls needed):
- 9 analysis types: summarize, key_provisions, qa_practical, explain_simple, scope, classify, legal_basis, amounts, full_text
- 5 short metadata recall types: meta_type, meta_issuer, meta_date, meta_title, meta_status
- Vietnamese legal hierarchy knowledge baked into system prompts
- Quality-filtered: min 60 char responses, no within-doc duplicates
- Every doc gets 1 full_text + 3 random QA types = 4 records per doc
- DuckDB-backed cache for memory-efficient processing (~145 MB RSS)
```bash
# Reproduce the dataset
pip install requests python-dotenv beautifulsoup4 lxml pyarrow datasets duckdb
python generate.py --fresh --qa-types 3 --upload duyet/vietnamese-legal-instruct
```
## Source
- **Original dataset**: [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) by Thịnh Ngô
- **Data source**: [vbpl.vn](https://vbpl.vn) — Vietnamese Ministry of Justice
- **License**: CC BY 4.0
## Citation
```bibtex
@dataset{vietnamese_legal_instruct_2026,
title = {Vietnamese Legal Instruction Dataset},
author = {Duyet Le},
year = {2026},
publisher = {Hugging Face},
doi = {10.57967/hf/8343},
url = {https://huggingface.co/datasets/duyet/vietnamese-legal-instruct},
note = {468K instruction pairs from 127K Vietnamese legal documents, 14 QA types}
}
```
---
language:
- 越南语(Vietnamese)
license: CC BY 4.0
task_categories:
- 文本生成(text-generation)
- 问答(question-answering)
tags:
- 越南语(Vietnamese)
- 法律(legal)
- 指令微调(instruction-tuning)
- Unsloth
- 微调(fine-tuning)
- 法律(law)
size_categories:
- 10万<样本数<100万
---
# 越南语法律指令数据集(Vietnamese Legal Instruction Dataset)
**数据集**: [https://huggingface.co/datasets/duyet/vietnamese-legal-instruct] | **源代码**: [https://github.com/duyet/vietnamese-legal-documents-dataset]
本指令遵循数据集基于[th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents)构建,原始数据集包含来自[越南司法部政府法律文档门户vbpl.vn](https://vbpl.vn)的12.7万份越南法律文档。
本数据集包含467732条训练样本对,覆盖14类问答任务,蕴含深度的越南法律层级体系知识。每份文档均配有用于内容召回的`full_text`(全文)样本对,以及5类用于记忆性检索的短元数据问答样本。
## 统计信息
| 指标 | 数值 |
|--------|-------|
| 总记录数 | 467,732 |
| 源文档数 | 116,933(从127,271份唯一文档中按长度筛选得到) |
| 问答类型数 | 14 |
| 训练集划分 | 444,346(占比95%) |
| 测试集划分 | 23,386(占比5%) |
## 问答类型分布
| 问答类型 | 样本数 | 占比 | 描述 |
|------|------:|--:|-------------|
| `full_text` | 116,933 | 25.0 | 全文档内容(每份文档对应1条,用于内容召回) |
| `scope` | 35,715 | 7.6 | 适用范围、生效日期 |
| `classify` | 35,401 | 7.6 | 文档类型与层级定位 |
| `summarize` | 35,226 | 7.5 | 3-5句结构化摘要 |
| `meta_date` | 29,247 | 6.3 | 发布日期与生效日期(短格式) |
| `explain_simple` | 29,176 | 6.2 | 面向非法律从业者的通俗解释 |
| `meta_issuer` | 29,059 | 6.2 | 发布机构(短格式) |
| `meta_title` | 29,049 | 6.2 | 标题与主题(短格式) |
| `meta_type` | 29,013 | 6.2 | 文档类型与层级级别(短格式) |
| `qa_practical` | 28,982 | 6.2 | 实操合规类问答 |
| `meta_status` | 22,201 | 4.7 | 当前法律状态(短格式) |
| `key_provisions` | 22,061 | 4.7 | 核心条款与条文 |
| `legal_basis` | 17,867 | 3.8 | 法律依据链条分析 |
| `amounts` | 7,802 | 1.7 | 金额与百分比数值 |
## 文档类型分布(前10名)
| 文档类型 | 样本数 |
|---------------|------:|
| Quyết định(决定/Decisions) | 137,524 |
| Nghị quyết(决议/Resolutions) | 40,404 |
| Thông tư(通知/Circulars) | 23,544 |
| Chỉ thị(指令/Directives) | 16,198 |
| Nghị định(政令/Decrees) | 6,610 |
| Thông tư liên tịch(联合通知/Joint Circulars) | 4,834 |
| Sắc lệnh(法令/Ordinances) | 1,950 |
| Nghị Quyết(决议) | 864 |
| Lệnh(命令/Orders) | 750 |
| Pháp lệnh(法令/Ordinances) | 404 |
## 数据格式
适配Unsloth的对话格式(包含`conversations`列,内含`role`(角色)与`content`(内容)字段):
json
{
"conversations": [
{"role": "system", "content": "Bạn là chuyên gia pháp luật Việt Nam..."},
{"role": "user", "content": "Tóm tắt văn bản sau..."},
{"role": "assistant", "content": "Văn bản này quy định về..."}
],
"source_id": "12345",
"document_type": "Nghị định",
"qa_type": "summarize"
}
## 越南法律层级体系
本数据集编码了越南法律文档层级体系知识(依据2015年《法律文件颁布法》(Luật ban hành VBQPPL 2015)):
1. 宪法(Hiến pháp / Constitution)
2. 法律、法典(Luật, Bộ luật / Laws, Codes)—— 由国会制定
3. 法令、命令(Pháp lệnh, Lệnh / Ordinances, Orders)—— 由国会常务委员会/国家主席发布
4. 政令、决议(Nghị định, Nghị quyết / Decrees, Resolutions)—— 由政府发布
5. 通知、决定(Thông tư, Quyết định / Circulars, Decisions)—— 由部长/人民委员会发布
6. 指令(Chỉ thị / Directives)—— 由总理/人民委员会主席发布
## 快速上手Unsloth
python
from unsloth import FastModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset
model, tokenizer = FastModel.from_pretrained(
model_name="unsloth/gemma-3-4B-it",
max_seq_length=4096,
load_in_4bit=True,
)
tokenizer = get_chat_template(tokenizer, chat_template="gemma")
dataset = load_dataset("duyet/vietnamese-legal-instruct", split="train")
def formatting_prompts_func(examples):
texts = [
tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
for convo in examples["conversations"]
]
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
args=SFTConfig(
per_device_train_batch_size=2, learning_rate=2e-4,
packing=True, # 2-5x speedup for mixed-length data
),
)
trainer.train()
## 生成流程
GitHub仓库:[duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) | 源代码:[`generate.py`](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct/blob/main/generate.py)
本数据集通过14种本地问答生成器构建,无需调用大语言模型API:
- 9类分析型任务:摘要、核心条款、实操合规问答、通俗解释、适用范围、文档分类、法律依据分析、金额数值提取、全文内容召回
- 5类短元数据检索任务:文档类型、发布机构、发布/生效日期、标题主题、法律状态
- 系统提示中嵌入了越南法律层级体系知识
- 经过质量筛选:响应内容不少于60字符,无文档内重复样本
- 每份文档生成1条`full_text`样本与3条随机问答样本,即每份文档对应4条数据记录
- 使用DuckDB作为缓存,实现内存高效处理(常驻内存约145MB RSS)
bash
# Reproduce the dataset
pip install requests python-dotenv beautifulsoup4 lxml pyarrow datasets duckdb
python generate.py --fresh --qa-types 3 --upload duyet/vietnamese-legal-instruct
## 数据来源
- **原始数据集**: [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) by Thịnh Ngô
- **数据来源**: [vbpl.vn](https://vbpl.vn) — 越南司法部
- **许可证**: CC BY 4.0
## 引用格式
bibtex
@dataset{vietnamese_legal_instruct_2026,
title = {越南语法律指令数据集(Vietnamese Legal Instruction Dataset)},
author = {Duyet Le},
year = {2026},
publisher = {Hugging Face},
doi = {10.57967/hf/8343},
url = {https://huggingface.co/datasets/duyet/vietnamese-legal-instruct},
note = {包含来自12.7万份越南法律文档的46.8万条指令样本对,覆盖14类问答任务}
}
提供机构:
duyet



