lakshyatest/india-case-legal-rag

Name: lakshyatest/india-case-legal-rag
Creator: lakshyatest
Published: 2026-04-06 18:44:09
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/lakshyatest/india-case-legal-rag

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en tags: - rag - legal-corpus - indian-law - supreme-court - legal-documents - information-retrieval --- # Dataset Card for Indian legal documents (Chunked for RAG) ## Dataset Details ### Dataset Description This dataset consists of **Indian Supreme Court judgments** extracted from publicly available court documents and processed into **textual chunks suitable for Retrieval-Augmented Generation (RAG)** and legal information retrieval tasks. Each document is split into semantically coherent text chunks and stored in a JSONL format with associated metadata such as source document name, chunk index, and total page count. The dataset is designed to support experimentation with legal-domain retrieval, embedding models, hybrid search (BM25 + dense), reranking, and downstream question-answering systems. The content primarily includes: - Raw Supreme Court judgment text with case identifiers and party names embedded in the prose (e.g., petitioners, respondents, court name, dates, and citations where present) - Narrative descriptions of the factual background, as discussed within the judgment body - Judicial reasoning, legal analysis, and final holdings authored by the court --- ## Getting Started To help you get started with building legal domain assistants, we have implemented and evaluated several RAG architectures using this dataset — detailed in our [Medium Article](https://medium.com/@engineering_13123/building-rag-systems-for-legal-documents-understanding-the-challenge-2e67fd4cce86) and available to explore in our [GitHub repository](https://github.com/DevDolphins/legal-bench-rag). You can use this dataset to experiment with the following strategies: - Baseline Retrieval: Utilizing standard Recursive Character Text Splitting (RCTS) for chunking paired with dense retrieval. - Summary Indexing: Generating and indexing summaries of the legal chunks to improve semantic matching and context capture. - Summary Indexing with Reranking: Enhancing the summary-based retrieval pipeline by applying a cross-encoder or reranker to reorder the top retrieved documents. - Contextual Embedding: Appending broader document-level context to individual chunks before generating embeddings to preserve legal nuance. - Contextual Embedding + Hybrid Retrieval: Combining contextual embeddings with sparse retrieval (e.g., BM25) to capture both semantic meaning and exact legal keyword matches. - Contextual Embedding + Reranking: Applying a final reranking step over the hybrid or dense results retrieved via contextual embeddings for maximum precision. ### Dataset Sources - **Source:** Publicly available Supreme Court of India judgment documents ## Uses  ### Direct Use This dataset is suitable for: - Retrieval-Augmented Generation (RAG) systems in the legal domain - Dense and sparse retrieval benchmarking (BM25, embeddings, hybrid search) - Chunking strategy evaluation for long legal documents - Legal question answering and case law exploration - Legal NLP research and academic experimentation ### Out-of-Scope Use This dataset is **not suitable** for: - Providing legal advice or real-world legal decision-making - Training models intended to replace qualified legal professionals - Tasks requiring up-to-date or jurisdiction-wide legal completeness - Predictive legal analytics without further validation and augmentation ## Dataset Structure The dataset is stored in **JSON Lines (`.jsonl`) format**, where each line represents a single text chunk. ### Fields Each record contains the following fields: - `id` *(int)*: Unique identifier for the chunk - `text` *(string)*: Extracted text content from the judgment - `metadata` *(object)*: - `source` *(string)*: Original PDF filename - `chunk_index` *(int)*: Position of the chunk within the document - `total_pages` *(int)*: Total number of pages in the source document ### Example ```json { "id": 1, "text": "http://JUDIS.NIC.IN SUPREME COURT OF INDIA Page 1 of 6...", "metadata": { "source": "6482417cc33c75ac1d880101.pdf", "chunk_index": 1, "total_pages": 6 } } ``` ## References ```bibtex @misc{india_case_legal_rag, author = {Hruthika S, Ajinkya T}, title = {Indian Legal Corpus Dataset for RAG}, publisher = {DevDolphins / HuggingFace Hub}, year = {2026}, url = {https://huggingface.co/datasets/dedol-hf/india-case-legal-rag}, version = {1.0.0}, note = {Accessed: 2026-02-14} } ``` For further details and context, please refer to: - [Medium Article](https://medium.com/@engineering_13123/building-rag-systems-for-legal-documents-understanding-the-challenge-2e67fd4cce86) - [Code Base - GitHub](https://github.com/DevDolphins/legal-bench-rag) --- ## Dataset Card Contact [DevDolphins](https://www.devdolphins.com)

提供机构：

lakshyatest

5,000+

优质数据集

54 个

任务类型

进入经典数据集