lakshyatest/india-case-legal-rag
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lakshyatest/india-case-legal-rag
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
tags:
- rag
- legal-corpus
- indian-law
- supreme-court
- legal-documents
- information-retrieval
---
# Dataset Card for Indian legal documents (Chunked for RAG)
## Dataset Details
### Dataset Description
This dataset consists of **Indian Supreme Court judgments** extracted from publicly available court documents and processed into **textual chunks suitable for Retrieval-Augmented Generation (RAG)** and legal information retrieval tasks.
Each document is split into semantically coherent text chunks and stored in a JSONL format with associated metadata such as source document name, chunk index, and total page count. The dataset is designed to support experimentation with legal-domain retrieval, embedding models, hybrid search (BM25 + dense), reranking, and downstream question-answering systems.
The content primarily includes:
- Raw Supreme Court judgment text with case identifiers and party names embedded in the prose (e.g., petitioners, respondents, court name, dates, and citations where present)
- Narrative descriptions of the factual background, as discussed within the judgment body
- Judicial reasoning, legal analysis, and final holdings authored by the court
---
## Getting Started
To help you get started with building legal domain assistants, we have implemented and evaluated several RAG architectures using this dataset — detailed in our [Medium Article](https://medium.com/@engineering_13123/building-rag-systems-for-legal-documents-understanding-the-challenge-2e67fd4cce86) and available to explore in our [GitHub repository](https://github.com/DevDolphins/legal-bench-rag).
You can use this dataset to experiment with the following strategies:
- Baseline Retrieval: Utilizing standard Recursive Character Text Splitting (RCTS) for chunking paired with dense retrieval.
- Summary Indexing: Generating and indexing summaries of the legal chunks to improve semantic matching and context capture.
- Summary Indexing with Reranking: Enhancing the summary-based retrieval pipeline by applying a cross-encoder or reranker to reorder the top retrieved documents.
- Contextual Embedding: Appending broader document-level context to individual chunks before generating embeddings to preserve legal nuance.
- Contextual Embedding + Hybrid Retrieval: Combining contextual embeddings with sparse retrieval (e.g., BM25) to capture both semantic meaning and exact legal keyword matches.
- Contextual Embedding + Reranking: Applying a final reranking step over the hybrid or dense results retrieved via contextual embeddings for maximum precision.
### Dataset Sources
- **Source:** Publicly available Supreme Court of India judgment documents
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
This dataset is suitable for:
- Retrieval-Augmented Generation (RAG) systems in the legal domain
- Dense and sparse retrieval benchmarking (BM25, embeddings, hybrid search)
- Chunking strategy evaluation for long legal documents
- Legal question answering and case law exploration
- Legal NLP research and academic experimentation
### Out-of-Scope Use
This dataset is **not suitable** for:
- Providing legal advice or real-world legal decision-making
- Training models intended to replace qualified legal professionals
- Tasks requiring up-to-date or jurisdiction-wide legal completeness
- Predictive legal analytics without further validation and augmentation
## Dataset Structure
The dataset is stored in **JSON Lines (`.jsonl`) format**, where each line represents a single text chunk.
### Fields
Each record contains the following fields:
- `id` *(int)*: Unique identifier for the chunk
- `text` *(string)*: Extracted text content from the judgment
- `metadata` *(object)*:
- `source` *(string)*: Original PDF filename
- `chunk_index` *(int)*: Position of the chunk within the document
- `total_pages` *(int)*: Total number of pages in the source document
### Example
```json
{
"id": 1,
"text": "http://JUDIS.NIC.IN SUPREME COURT OF INDIA Page 1 of 6...",
"metadata": {
"source": "6482417cc33c75ac1d880101.pdf",
"chunk_index": 1,
"total_pages": 6
}
}
```
## References
```bibtex
@misc{india_case_legal_rag,
author = {Hruthika S, Ajinkya T},
title = {Indian Legal Corpus Dataset for RAG},
publisher = {DevDolphins / HuggingFace Hub},
year = {2026},
url = {https://huggingface.co/datasets/dedol-hf/india-case-legal-rag},
version = {1.0.0},
note = {Accessed: 2026-02-14}
}
```
For further details and context, please refer to:
- [Medium Article](https://medium.com/@engineering_13123/building-rag-systems-for-legal-documents-understanding-the-challenge-2e67fd4cce86)
- [Code Base - GitHub](https://github.com/DevDolphins/legal-bench-rag)
---
## Dataset Card Contact
[DevDolphins](https://www.devdolphins.com)
提供机构:
lakshyatest



