antonhome/indian-legal-supervised-fine-tuning-data
收藏Hugging Face2025-12-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/antonhome/indian-legal-supervised-fine-tuning-data
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: context
dtype: string
- name: question
dtype: string
- name: response
dtype: string
splits:
- name: train
num_bytes: 16325944571
num_examples: 6055371
download_size: 9101140222
dataset_size: 16325944571
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
license: apache-2.0
language:
- hi
- en
- ta
- mr
- bn
- or
- te
---
# 🇮🇳 LegalBrain Indic Legal Corpus
A large-scale **multilingual Indian legal dataset** curated to support research in:
- Domain-specific **LLM training**
- **Legal question answering**
- **Policy reasoning & case retrieval**
- Agentic systems for **legal workflow automation**
This dataset contains text drawn from publicly available legal sources across multiple Indian languages, including:
**English, Hindi, Marathi, Bengali, Kannada, Tamil, Telugu, Odia**, and others.
The corpus is structured and processed to be directly usable for **supervised fine-tuning**, **RAG pipelines**, and **conversational legal assistants**.
---
## 📦 Dataset Structure
After preprocessing and supervised alignment, the dataset is provided in the format:
| context | question | response |
|--------|----------|-----------|
| Multi-paragraph legal text (case summary, statute, commentary) | A legal or interpretation-type query derived from the context | Answer grounded in the specific information contained in *context* |
This enables training of:
- Legal chatbots
- Agentic reasoning systems
- Legal retrieval-augmented QA models
- Court case summarizers
- Argumentation-based LLM pipelines
---
## 🏛️ **Data Sources**
Data was collected only from **publicly and legally accessible sources**, including:
- Supreme Court judgments
- High Court decisions
- Law Commission reports
- Public legal textbooks & commentaries
- Open legal news archives
- Public domain legal Q&A portals
- Government acts, rules, and notifications
**No proprietary or licensed content** was used.
---
## 🧹 Cleaning & Normalization Pipeline
Large-scale legal data is noisy. The following steps were used:
1. **HTML + Boilerplate Removal**
Removal of menus, footers, ads, repeated headers, legal boilerplate markers.
2. **OCR + Text Correction**
OCR applied to scanned PDFs using **Tesseract + custom normalization**, followed by regex-based cleanup for:
- Section markers
- Citations
- Case line references
3. **Language Detection & Segmentation**
Auto-sharding by language → Sentence & clause-level segmentation using spaCy + Indic NLP.
4. **De-duplication**
Removed near-duplicate clauses across multiple case reports using MinHash (LSH).
---
## ✨ Argilla-Based Supervised Dataset Construction
To transform unstructured text into **(context, question, response)** triplets, the dataset was processed using **Argilla** for human feedback and model-assisted annotation.
### Workflow:
1. Select meaningful legal text chunks (150–600 words).
2. Use a prompting pipeline to generate **candidate questions and answers**.
3. Load candidate examples into **Argilla workspace**.
4. Curate + refine:
- Fix hallucinations
- Improve citation grounding
- Ensure all responses **strictly reference context**
5. Export final reviewed dataset to HF.
**This ensures the dataset trains models that cite the law instead of hallucinating it.**
---
## 🏗️ Example Entry
```json
{
"context": "The right to constitutional remedies allows citizens to approach the Supreme Court under Article 32...",
"question": "Which constitutional provision allows an individual to directly move the Supreme Court?",
"response": "Article 32 of the Constitution grants the right to constitutional remedies, enabling citizens to directly approach the Supreme Court."
}
提供机构:
antonhome



