SharathReddy/Indian-Legal-SFT-Dataset
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/SharathReddy/Indian-Legal-SFT-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
size_categories:
- 10K<n<100K
task_categories:
- question-answering
- text-generation
tags:
- legal
- indian-law
- instruction-tuning
- vidhaan
- constitution
- justice
configs:
- config_name: default
data_files:
- split: train
path: train.jsonl
dataset_info:
features:
- name: instruction
dtype: string
description: "The legal question or task."
- name: context
dtype: string
description: "The verbatim statutory text/section used for grounding."
- name: response
dtype: string
description: "The professional legal answer with specific section citations."
splits:
- name: train
num_examples: 20690
download_size: 8300000
dataset_size: 8300000
---
# Vidhaan: High-Density Indian Legal Instruction Dataset
**Vidhaan** is a comprehensive, high-precision instruction-tuning dataset containing **20,690 QA pairs** derived from 113 Central Acts of India. It was built specifically to solve the "context-splitting" problem found in standard legal RAG datasets.
## 🛠 Dataset Structure & Format
- **Primary File:** `vidhaan_training_v1.jsonl`
- **Format:** JSON Lines (JSONL)
- **Schema:** - `instruction`: (String) A precise legal query.
- `context`: (String) The specific legal text or section header from the source file.
- `response`: (String) A grounded answer starting with formal citations (e.g., "As per Section X...").
## 📂 Source Composition
The dataset spans 113 Markdown files organized into four primary domains:
- **Constitution of India:** 1 Comprehensive File.
- **Department of Justice:** 27 Files (e.g., Judges Inquiry Act, Family Courts Act).
- **Department of Legal Affairs:** 11 Files (e.g., Advocates Act, Notaries Act).
- **Legislative Department:** 74 Files (e.g., Indian Contract Act, Transfer of Property Act).
## 🧠 Methodology & Training Pipeline
### 1. High-Fidelity Conversion
Original government PDFs were converted to Markdown using **Docling**. This ensured that critical structural elements like **State Amendment boxes** (e.g., Bihar/Assam specific changes) and **Footnotes** were captured as text rather than being discarded or mangled by standard OCR.
### 2. Semantic Logic Splitting
To prevent a rule from being separated from its "Provided that" exception, we abandoned fixed-character chunking. We used **Regex-based Semantic Splitting** (`\n(?=\d+\.\s|##\s|CHAPTER\s)`) to ensure every training instance contains a complete, intact legal section.
### 3. Exhaustive QA Generation
Using `gpt-4o-mini`, we performed **Exhaustive Content Mapping**. Instead of a fixed number of questions per chunk, the model was mandated to generate a pair for **every** distinct sub-section, definition, and procedural timeline found in the text.
## ✅ Quality Assurance & Validation
- **Total Audited Pairs:** 20,690
- **Malformed/Skipped Lines:** 0 (Verified via post-processing audit).
- **Section Coverage:** - **112/113 files:** Achieved **100% verified coverage** of all section headers.
- **Code of Civil Procedure (CPC):** Achieved **98% coverage**. Note: The 2% "missing" were identified as false positives (years like 1870/1883 mentioned in text rather than missing section numbers).
- **Citation Integrity:** 100% of responses contain verified statutory citations.
## 🚀 Use Cases
- **Fine-tuning LLMs** for the Indian Judicial System.
- **Evaluating Legal RAG** systems on statutory accuracy.
- **Procedural Law Automation** (identifying limitation periods and appeal timelines).
---
**Author:** Sharath Reddy
**Project:** Vidhaan AI
**Data Integrity:** Verified 100% string-type for all features.
提供机构:
SharathReddy



