SharathReddy/Indian-Legal-SFT-Dataset

Name: SharathReddy/Indian-Legal-SFT-Dataset
Creator: SharathReddy
Published: 2026-03-24 05:01:37
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/SharathReddy/Indian-Legal-SFT-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit size_categories: - 10K<n<100K task_categories: - question-answering - text-generation tags: - legal - indian-law - instruction-tuning - vidhaan - constitution - justice configs: - config_name: default data_files: - split: train path: train.jsonl dataset_info: features: - name: instruction dtype: string description: "The legal question or task." - name: context dtype: string description: "The verbatim statutory text/section used for grounding." - name: response dtype: string description: "The professional legal answer with specific section citations." splits: - name: train num_examples: 20690 download_size: 8300000 dataset_size: 8300000 --- # Vidhaan: High-Density Indian Legal Instruction Dataset **Vidhaan** is a comprehensive, high-precision instruction-tuning dataset containing **20,690 QA pairs** derived from 113 Central Acts of India. It was built specifically to solve the "context-splitting" problem found in standard legal RAG datasets. ## 🛠 Dataset Structure & Format - **Primary File:** `vidhaan_training_v1.jsonl` - **Format:** JSON Lines (JSONL) - **Schema:** - `instruction`: (String) A precise legal query. - `context`: (String) The specific legal text or section header from the source file. - `response`: (String) A grounded answer starting with formal citations (e.g., "As per Section X..."). ## 📂 Source Composition The dataset spans 113 Markdown files organized into four primary domains: - **Constitution of India:** 1 Comprehensive File. - **Department of Justice:** 27 Files (e.g., Judges Inquiry Act, Family Courts Act). - **Department of Legal Affairs:** 11 Files (e.g., Advocates Act, Notaries Act). - **Legislative Department:** 74 Files (e.g., Indian Contract Act, Transfer of Property Act). ## 🧠 Methodology & Training Pipeline ### 1. High-Fidelity Conversion Original government PDFs were converted to Markdown using **Docling**. This ensured that critical structural elements like **State Amendment boxes** (e.g., Bihar/Assam specific changes) and **Footnotes** were captured as text rather than being discarded or mangled by standard OCR. ### 2. Semantic Logic Splitting To prevent a rule from being separated from its "Provided that" exception, we abandoned fixed-character chunking. We used **Regex-based Semantic Splitting** (`\n(?=\d+\.\s|##\s|CHAPTER\s)`) to ensure every training instance contains a complete, intact legal section. ### 3. Exhaustive QA Generation Using `gpt-4o-mini`, we performed **Exhaustive Content Mapping**. Instead of a fixed number of questions per chunk, the model was mandated to generate a pair for **every** distinct sub-section, definition, and procedural timeline found in the text. ## ✅ Quality Assurance & Validation - **Total Audited Pairs:** 20,690 - **Malformed/Skipped Lines:** 0 (Verified via post-processing audit). - **Section Coverage:** - **112/113 files:** Achieved **100% verified coverage** of all section headers. - **Code of Civil Procedure (CPC):** Achieved **98% coverage**. Note: The 2% "missing" were identified as false positives (years like 1870/1883 mentioned in text rather than missing section numbers). - **Citation Integrity:** 100% of responses contain verified statutory citations. ## 🚀 Use Cases - **Fine-tuning LLMs** for the Indian Judicial System. - **Evaluating Legal RAG** systems on statutory accuracy. - **Procedural Law Automation** (identifying limitation periods and appeal timelines). --- **Author:** Sharath Reddy **Project:** Vidhaan AI **Data Integrity:** Verified 100% string-type for all features.

提供机构：

SharathReddy

5,000+

优质数据集

54 个

任务类型

进入经典数据集