five

datht/vlegal-train

收藏
Hugging Face2026-04-15 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/datht/vlegal-train
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: conversations list: - name: role dtype: string - name: content dtype: string - name: metadata struct: - name: source dtype: string - name: task dtype: string - name: task_type dtype: string - name: language dtype: string splits: - name: train num_examples: 11025 - name: validation num_examples: 1225 configs: - config_name: default data_files: - split: train path: data/combined_vi_train.jsonl - split: validation path: data/combined_vi_val.jsonl - config_name: legal-chat data_files: - split: train path: data/legal-chat_vi.jsonl - config_name: legal-documents data_files: - split: train path: data/legal-documents_vi.jsonl language: - vi license: apache-2.0 task_categories: - text-generation - question-answering tags: - legal - vietnamese - sft - chatml - training-data size_categories: - 10K<n<100K --- # Vietnamese Legal SFT Training Data Training data for Vietnamese Legal SLMs. Processed into standardized ChatML conversation format. > **For evaluation, use [datht/vlegal](https://huggingface.co/datasets/datht/vlegal) (VLegal-Bench).** > This dataset is for TRAINING ONLY. No overlap with VLegal-Bench. ## Sources | Source | Samples | Type | License | |--------|---------|------|---------| | [luanngo/Vietnamese-Legal-Chat-Dataset](https://huggingface.co/datasets/luanngo/Vietnamese-Legal-Chat-Dataset) | 3,537 | Legal QA conversations | VLSP research | | [th1nhng0/vietnamese-legal-documents](https://huggingface.co/datasets/th1nhng0/vietnamese-legal-documents) | 8,713 | Document summarization | CC BY 4.0 | ## Splits | Split | Samples | |-------|---------| | train | 11,025 | | validation | 1,225 | ## Format ```json { "conversations": [ {"role": "system", "content": "Vietnamese legal assistant prompt"}, {"role": "user", "content": "Legal question or instruction"}, {"role": "assistant", "content": "Answer"} ], "metadata": {"source": "legal-chat", "task": "legal_chat", "task_type": "qa", "language": "vi"} } ``` ## Usage ```python from datasets import load_dataset # Load combined training data train = load_dataset("datht/vlegal-train", split="train") # Load specific source chat_data = load_dataset("datht/vlegal-train", "legal-chat", split="train") doc_data = load_dataset("datht/vlegal-train", "legal-documents", split="train") ``` ## Training Pipeline ```bash # Using nlp-trainer framework cd module/sft bash scripts/train.sh --model qwen3-1.7b --push --hub-name "datht/viet-legal-1.7B" ``` Processed with [nlp-trainer](https://github.com/datht4889/nlp-trainer).
提供机构:
datht
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作