five

duyet/vietnamese-legal-instruct

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/duyet/vietnamese-legal-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - vi license: cc-by-4.0 task_categories: - text-generation - question-answering tags: - vietnamese - legal - instruction-tuning - unsloth - fine-tuning - law size_categories: - 100K<n<1M --- # Vietnamese Legal Instruction Dataset **Dataset**: [huggingface.co/datasets/duyet/vietnamese-legal-instruct](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct) | **Source code**: [github.com/duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) Instruction-following dataset built from [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) — 127K Vietnamese legal documents from [vbpl.vn](https://vbpl.vn) (Government Legal Document Portal, Ministry of Justice). **467,732 training pairs** across 14 QA types with deep Vietnamese legal hierarchy knowledge. Every document has a `full_text` pair for content recall, plus 5 short metadata recall types for memorization. ## Statistics | Metric | Value | |--------|-------| | Total records | 467,732 | | Source documents | 116,933 (from 127,271 unique, filtered by length) | | QA types | 14 | | Train split | 444,346 (95%) | | Test split | 23,386 (5%) | ### QA Type Distribution | Type | Count | % | Description | |------|------:|--:|-------------| | `full_text` | 116,933 | 25.0 | Full document content (1 per doc for content recall) | | `scope` | 35,715 | 7.6 | Scope, applicability, effective dates | | `classify` | 35,401 | 7.6 | Document type & hierarchy position | | `summarize` | 35,226 | 7.5 | 3-5 sentence structured summary | | `meta_date` | 29,247 | 6.3 | Issue date & effective date (short) | | `explain_simple` | 29,176 | 6.2 | Plain language for non-lawyers | | `meta_issuer` | 29,059 | 6.2 | Issuing authority (short) | | `meta_title` | 29,049 | 6.2 | Title & subject (short) | | `meta_type` | 29,013 | 6.2 | Document type & hierarchy level (short) | | `qa_practical` | 28,982 | 6.2 | Practical compliance Q&A | | `meta_status` | 22,201 | 4.7 | Current legal status (short) | | `key_provisions` | 22,061 | 4.7 | Key articles and provisions | | `legal_basis` | 17,867 | 3.8 | Legal basis chain analysis | | `amounts` | 7,802 | 1.7 | Monetary amounts & percentages | ### Document Type Distribution (top 10) | Document Type | Count | |---------------|------:| | Quyết định (Decisions) | 137,524 | | Nghị quyết (Resolutions) | 40,404 | | Thông tư (Circulars) | 23,544 | | Chỉ thị (Directives) | 16,198 | | Nghị định (Decrees) | 6,610 | | Thông tư liên tịch (Joint Circulars) | 4,834 | | Sắc lệnh (Ordinances) | 1,950 | | Nghị Quyết | 864 | | Lệnh (Orders) | 750 | | Pháp lệnh (Ordinances) | 404 | ## Format Unsloth-compatible conversation format (`conversations` column with `role`/`content`): ```json { "conversations": [ {"role": "system", "content": "Bạn là chuyên gia pháp luật Việt Nam..."}, {"role": "user", "content": "Tóm tắt văn bản sau..."}, {"role": "assistant", "content": "Văn bản này quy định về..."} ], "source_id": "12345", "document_type": "Nghị định", "qa_type": "summarize" } ``` ## Vietnamese Legal Hierarchy The dataset encodes knowledge of the Vietnamese legal document hierarchy (per Luật ban hành VBQPPL 2015): ``` 1. Hiến pháp (Constitution) 2. Luật, Bộ luật (Laws, Codes) — Quốc hội 3. Pháp lệnh, Lệnh (Ordinances, Orders) — UBTVQH / Chủ tịch nước 4. Nghị định, Nghị quyết (Decrees, Resolutions) — Chính phủ 5. Thông tư, Quyết định (Circulars, Decisions) — Bộ trưởng / UBND 6. Chỉ thị (Directives) — Thủ tướng / Chủ tịch UBND ``` ## Quick Start with Unsloth ```python from unsloth import FastModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset model, tokenizer = FastModel.from_pretrained( model_name="unsloth/gemma-3-4B-it", max_seq_length=4096, load_in_4bit=True, ) tokenizer = get_chat_template(tokenizer, chat_template="gemma") dataset = load_dataset("duyet/vietnamese-legal-instruct", split="train") def formatting_prompts_func(examples): texts = [ tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in examples["conversations"] ] return {"text": texts} dataset = dataset.map(formatting_prompts_func, batched=True) from trl import SFTTrainer, SFTConfig trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig( per_device_train_batch_size=2, learning_rate=2e-4, packing=True, # 2-5x speedup for mixed-length data ), ) trainer.train() ``` ## Generation GitHub: [duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) | Source code: [`generate.py`](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct/blob/main/generate.py) Built with 14 local QA generators (no LLM API calls needed): - 9 analysis types: summarize, key_provisions, qa_practical, explain_simple, scope, classify, legal_basis, amounts, full_text - 5 short metadata recall types: meta_type, meta_issuer, meta_date, meta_title, meta_status - Vietnamese legal hierarchy knowledge baked into system prompts - Quality-filtered: min 60 char responses, no within-doc duplicates - Every doc gets 1 full_text + 3 random QA types = 4 records per doc - DuckDB-backed cache for memory-efficient processing (~145 MB RSS) ```bash # Reproduce the dataset pip install requests python-dotenv beautifulsoup4 lxml pyarrow datasets duckdb python generate.py --fresh --qa-types 3 --upload duyet/vietnamese-legal-instruct ``` ## Source - **Original dataset**: [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) by Thịnh Ngô - **Data source**: [vbpl.vn](https://vbpl.vn) — Vietnamese Ministry of Justice - **License**: CC BY 4.0 ## Citation ```bibtex @dataset{vietnamese_legal_instruct_2026, title = {Vietnamese Legal Instruction Dataset}, author = {Duyet Le}, year = {2026}, publisher = {Hugging Face}, doi = {10.57967/hf/8343}, url = {https://huggingface.co/datasets/duyet/vietnamese-legal-instruct}, note = {468K instruction pairs from 127K Vietnamese legal documents, 14 QA types} } ```

--- language: - 越南语(Vietnamese) license: CC BY 4.0 task_categories: - 文本生成(text-generation) - 问答(question-answering) tags: - 越南语(Vietnamese) - 法律(legal) - 指令微调(instruction-tuning) - Unsloth - 微调(fine-tuning) - 法律(law) size_categories: - 10万<样本数<100万 --- # 越南语法律指令数据集(Vietnamese Legal Instruction Dataset) **数据集**: [https://huggingface.co/datasets/duyet/vietnamese-legal-instruct] | **源代码**: [https://github.com/duyet/vietnamese-legal-documents-dataset] 本指令遵循数据集基于[th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents)构建,原始数据集包含来自[越南司法部政府法律文档门户vbpl.vn](https://vbpl.vn)的12.7万份越南法律文档。 本数据集包含467732条训练样本对,覆盖14类问答任务,蕴含深度的越南法律层级体系知识。每份文档均配有用于内容召回的`full_text`(全文)样本对,以及5类用于记忆性检索的短元数据问答样本。 ## 统计信息 | 指标 | 数值 | |--------|-------| | 总记录数 | 467,732 | | 源文档数 | 116,933(从127,271份唯一文档中按长度筛选得到) | | 问答类型数 | 14 | | 训练集划分 | 444,346(占比95%) | | 测试集划分 | 23,386(占比5%) | ## 问答类型分布 | 问答类型 | 样本数 | 占比 | 描述 | |------|------:|--:|-------------| | `full_text` | 116,933 | 25.0 | 全文档内容(每份文档对应1条,用于内容召回) | | `scope` | 35,715 | 7.6 | 适用范围、生效日期 | | `classify` | 35,401 | 7.6 | 文档类型与层级定位 | | `summarize` | 35,226 | 7.5 | 3-5句结构化摘要 | | `meta_date` | 29,247 | 6.3 | 发布日期与生效日期(短格式) | | `explain_simple` | 29,176 | 6.2 | 面向非法律从业者的通俗解释 | | `meta_issuer` | 29,059 | 6.2 | 发布机构(短格式) | | `meta_title` | 29,049 | 6.2 | 标题与主题(短格式) | | `meta_type` | 29,013 | 6.2 | 文档类型与层级级别(短格式) | | `qa_practical` | 28,982 | 6.2 | 实操合规类问答 | | `meta_status` | 22,201 | 4.7 | 当前法律状态(短格式) | | `key_provisions` | 22,061 | 4.7 | 核心条款与条文 | | `legal_basis` | 17,867 | 3.8 | 法律依据链条分析 | | `amounts` | 7,802 | 1.7 | 金额与百分比数值 | ## 文档类型分布(前10名) | 文档类型 | 样本数 | |---------------|------:| | Quyết định(决定/Decisions) | 137,524 | | Nghị quyết(决议/Resolutions) | 40,404 | | Thông tư(通知/Circulars) | 23,544 | | Chỉ thị(指令/Directives) | 16,198 | | Nghị định(政令/Decrees) | 6,610 | | Thông tư liên tịch(联合通知/Joint Circulars) | 4,834 | | Sắc lệnh(法令/Ordinances) | 1,950 | | Nghị Quyết(决议) | 864 | | Lệnh(命令/Orders) | 750 | | Pháp lệnh(法令/Ordinances) | 404 | ## 数据格式 适配Unsloth的对话格式(包含`conversations`列,内含`role`(角色)与`content`(内容)字段): json { "conversations": [ {"role": "system", "content": "Bạn là chuyên gia pháp luật Việt Nam..."}, {"role": "user", "content": "Tóm tắt văn bản sau..."}, {"role": "assistant", "content": "Văn bản này quy định về..."} ], "source_id": "12345", "document_type": "Nghị định", "qa_type": "summarize" } ## 越南法律层级体系 本数据集编码了越南法律文档层级体系知识(依据2015年《法律文件颁布法》(Luật ban hành VBQPPL 2015)): 1. 宪法(Hiến pháp / Constitution) 2. 法律、法典(Luật, Bộ luật / Laws, Codes)—— 由国会制定 3. 法令、命令(Pháp lệnh, Lệnh / Ordinances, Orders)—— 由国会常务委员会/国家主席发布 4. 政令、决议(Nghị định, Nghị quyết / Decrees, Resolutions)—— 由政府发布 5. 通知、决定(Thông tư, Quyết định / Circulars, Decisions)—— 由部长/人民委员会发布 6. 指令(Chỉ thị / Directives)—— 由总理/人民委员会主席发布 ## 快速上手Unsloth python from unsloth import FastModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset model, tokenizer = FastModel.from_pretrained( model_name="unsloth/gemma-3-4B-it", max_seq_length=4096, load_in_4bit=True, ) tokenizer = get_chat_template(tokenizer, chat_template="gemma") dataset = load_dataset("duyet/vietnamese-legal-instruct", split="train") def formatting_prompts_func(examples): texts = [ tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in examples["conversations"] ] return {"text": texts} dataset = dataset.map(formatting_prompts_func, batched=True) from trl import SFTTrainer, SFTConfig trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig( per_device_train_batch_size=2, learning_rate=2e-4, packing=True, # 2-5x speedup for mixed-length data ), ) trainer.train() ## 生成流程 GitHub仓库:[duyet/vietnamese-legal-documents-dataset](https://github.com/duyet/vietnamese-legal-documents-dataset) | 源代码:[`generate.py`](https://huggingface.co/datasets/duyet/vietnamese-legal-instruct/blob/main/generate.py) 本数据集通过14种本地问答生成器构建,无需调用大语言模型API: - 9类分析型任务:摘要、核心条款、实操合规问答、通俗解释、适用范围、文档分类、法律依据分析、金额数值提取、全文内容召回 - 5类短元数据检索任务:文档类型、发布机构、发布/生效日期、标题主题、法律状态 - 系统提示中嵌入了越南法律层级体系知识 - 经过质量筛选:响应内容不少于60字符,无文档内重复样本 - 每份文档生成1条`full_text`样本与3条随机问答样本,即每份文档对应4条数据记录 - 使用DuckDB作为缓存,实现内存高效处理(常驻内存约145MB RSS) bash # Reproduce the dataset pip install requests python-dotenv beautifulsoup4 lxml pyarrow datasets duckdb python generate.py --fresh --qa-types 3 --upload duyet/vietnamese-legal-instruct ## 数据来源 - **原始数据集**: [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) by Thịnh Ngô - **数据来源**: [vbpl.vn](https://vbpl.vn) — 越南司法部 - **许可证**: CC BY 4.0 ## 引用格式 bibtex @dataset{vietnamese_legal_instruct_2026, title = {越南语法律指令数据集(Vietnamese Legal Instruction Dataset)}, author = {Duyet Le}, year = {2026}, publisher = {Hugging Face}, doi = {10.57967/hf/8343}, url = {https://huggingface.co/datasets/duyet/vietnamese-legal-instruct}, note = {包含来自12.7万份越南法律文档的46.8万条指令样本对,覆盖14类问答任务} }
提供机构:
duyet
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作