datht/vlegal-train
收藏Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/datht/vlegal-train
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: conversations
list:
- name: role
dtype: string
- name: content
dtype: string
- name: metadata
struct:
- name: source
dtype: string
- name: task
dtype: string
- name: task_type
dtype: string
- name: language
dtype: string
splits:
- name: train
num_examples: 11025
- name: validation
num_examples: 1225
configs:
- config_name: default
data_files:
- split: train
path: data/combined_vi_train.jsonl
- split: validation
path: data/combined_vi_val.jsonl
- config_name: legal-chat
data_files:
- split: train
path: data/legal-chat_vi.jsonl
- config_name: legal-documents
data_files:
- split: train
path: data/legal-documents_vi.jsonl
language:
- vi
license: apache-2.0
task_categories:
- text-generation
- question-answering
tags:
- legal
- vietnamese
- sft
- chatml
- training-data
size_categories:
- 10K<n<100K
---
# Vietnamese Legal SFT Training Data
Training data for Vietnamese Legal SLMs. Processed into standardized ChatML conversation format.
> **For evaluation, use [datht/vlegal](https://huggingface.co/datasets/datht/vlegal) (VLegal-Bench).**
> This dataset is for TRAINING ONLY. No overlap with VLegal-Bench.
## Sources
| Source | Samples | Type | License |
|--------|---------|------|---------|
| [luanngo/Vietnamese-Legal-Chat-Dataset](https://huggingface.co/datasets/luanngo/Vietnamese-Legal-Chat-Dataset) | 3,537 | Legal QA conversations | VLSP research |
| [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) | 8,713 | Document summarization | CC BY 4.0 |
## Splits
| Split | Samples |
|-------|---------|
| train | 11,025 |
| validation | 1,225 |
## Format
```json
{
"conversations": [
{"role": "system", "content": "Vietnamese legal assistant prompt"},
{"role": "user", "content": "Legal question or instruction"},
{"role": "assistant", "content": "Answer"}
],
"metadata": {"source": "legal-chat", "task": "legal_chat", "task_type": "qa", "language": "vi"}
}
```
## Usage
```python
from datasets import load_dataset
# Load combined training data
train = load_dataset("datht/vlegal-train", split="train")
# Load specific source
chat_data = load_dataset("datht/vlegal-train", "legal-chat", split="train")
doc_data = load_dataset("datht/vlegal-train", "legal-documents", split="train")
```
## Training Pipeline
```bash
# Using nlp-trainer framework
cd module/sft
bash scripts/train.sh --model qwen3-1.7b --push --hub-name "datht/viet-legal-1.7B"
```
Processed with [nlp-trainer](https://github.com/datht4889/nlp-trainer).
提供机构:
datht



