KanoonGPT/indian-case-laws
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/KanoonGPT/indian-case-laws
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Indian Case Laws
language:
- en
license: apache-2.0
tags:
- law
- legal
- india
- case-law
- judgments
- legal-tech
- text
- tabular
- datasets
---
# Indian Case Laws
**Open Indian case-law data for AI, search, and legal research.**
This dataset is part of the **KanoonGPT Open Legal Data Initiative** - an effort to make Indian legal data easier to access, trace, and build on for open-source research, legal tech, and production AI systems.
KanoonGPT is building structured Indian legal data and data infrastructure for open-source, research, and enterprise AI applications. Learn more at [kanoongpt.in](https://kanoongpt.in).
Repository: [KanoonGPT/indian-case-laws](https://huggingface.co/datasets/KanoonGPT/indian-case-laws)
## Coverage at a glance
- Timeframe covered: **1950-2026** (rolling; new judgments are added continuously).
- Courts covered: **Supreme Court of India + 25 High Courts**.
### Courts currently covered
| Code | Court |
|---|---|
| `SCI` | Supreme Court of India |
| `3~22` | High Court of Punjab and Haryana |
| `33~10` | Madras High Court |
| `36~29` | High Court for State of Telangana |
| `28~2` | High Court of Andhra Pradesh |
| `10~8` | Patna High Court |
| `32~4` | High Court of Kerala |
| `8~9` | High Court of Rajasthan |
| `21~11` | High Court of Orissa |
| `9~13` | Allahabad High Court |
| `22~18` | High Court of Chhattisgarh |
| `27~1` | Bombay High Court |
| `29~3` | High Court of Karnataka |
| `23~23` | High Court of Madhya Pradesh |
| `7~26` | High Court of Delhi |
| `20~7` | High Court of Jharkhand |
| `19~16` | Calcutta High Court |
| `24~17` | High Court of Gujarat |
| `2~5` | High Court of Himachal Pradesh |
| `18~6` | Gauhati High Court |
| `1~12` | High Court of Jammu and Kashmir |
| `5~15` | High Court of Uttarakhand |
| `16~20` | High Court of Tripura |
| `14~25` | High Court of Manipur |
| `17~21` | High Court of Meghalaya |
| `11~24` | High Court of Sikkim |
## Why this dataset exists
Indian case law is public, but still difficult to work with at scale as it is locked behind CAPTCHAs and PDFs.
Court documents are locked behind captcha, are often fragmented, inconsistently structured, and not packaged for modern ML, search, or analytics workflows. KanoonGPT standardizes this data into an **AI-ready, query-friendly, provenance-aware** format so builders can use it for:
- legal search and retrieval
- citation-aware RAG pipelines
- Indian legal NLP benchmarks
- case-law analytics
- metadata extraction
- legal copilots and agents
- downstream model training and evaluation
## What this dataset contains
This repository focuses on **structured metadata for Indian court judgments**, starting with **Supreme Court** and **High Court** records.
The dataset is designed to preserve both:
1. **clean flattened columns** for search, filtering, analytics, and ML pipelines
2. **full normalized source payloads** for auditability and reproducibility
Many records also include **public source links** to the underlying JSON and judgment PDF artifacts.
## Provenance
This dataset is built from public Indian court judgment data sourced from the **eCourts judgments ecosystem** and mirrored through the AWS Open Data releases for Indian High Court and Indian Supreme Court judgments. Those AWS pages describe the datasets, their scope, and access points.
Upstream open-data references:
- [Indian High Court Judgments — AWS Open Data](https://registry.opendata.aws/indian-high-court-judgments/)
- [Indian Supreme Court Judgments — AWS Open Data](https://registry.opendata.aws/indian-supreme-court-judgments/)
KanoonGPT packages, normalizes, and republishes this data in a Hugging Face-friendly format for research and AI use cases.
## Release variants
This dataset is being published in three variants:
| Variant | What it contains | Status |
|---|---|---|
| `sample` | Non-partitioned representative subset for quick exploration, demos, schema inspection, and integration tests. Current rule: for each `(source_path_year, court_code)` group, fetch up to 10,000 rows, then randomize and keep 20 rows. | Available |
| `structured` | Full flattened metadata for judgments, including parties, citations, court details, dates, provenance, and quality signals. | Rolling release |
| `full` | Structured metadata plus judgment text payloads for retrieval, fine-tuning, and text-heavy downstream tasks. | Coming soon |
## Schema overview
The current structured release centers on flattened metadata for HC/SC judgments.
### Identity and dataset fields
- `id`
- `case_metadata_id`
- `dataset_source`
- `parser_record_id`
- `ingestion_split`
### Parties and case caption
- `case_title`
- `party_petitioner`
- `party_respondent`
- `party_caption`
### Legal references
- `docket_number`
- `cnr_number`
- `neutral_citation`
- `law_report_citation`
### Court and adjudicators
- `court_name`
- `court_code`
- `bench_name`
- `presiding_judge`
- `coram_members`
- `coram_members_text`
### Dates and disposition
- `decision_date`
- `registration_date`
- `citation_year`
- `decision_year`
- `disposition_text`
### Source provenance and artifacts
- `source_relative_path`
- `source_path_year`
- `source_path_court_code`
- `source_path_bench`
- `source_filename`
- `source_json_s3_url`
- `source_pdf_s3_url`
- `language_codes`
### Search, parser, and quality fields
- `indexable_text`
- `headnote_text`
- `normalized_record_json`
- `parser_json`
- `quality_json`
- `created_at`
- `updated_at`
For parquet publishing stability, `normalized_record_json`, `parser_json`, and `quality_json` are serialized as JSON strings in exported files.
## Example record
A sample record may include fields such as:
- case title and party names
- docket number and CNR number
- neutral citation and law report citation
- court name and bench
- coram / adjudicators
- decision date and case disposition
- provenance URLs pointing to source JSON and judgment PDF
- parser diagnostics and quality flags
- a full `normalized_record_json` snapshot for traceability
This makes the dataset useful both for lightweight metadata workflows and for provenance-sensitive legal AI systems.
## Design principles
### AI-ready, not just archive-ready
This dataset is intended for **real downstream usage** — search, ranking, retrieval, analytics, evaluation, and model-building — not just passive storage.
### Structured first
Important legal signals such as parties, citations, dates, bench details, and outcomes are flattened into stable columns instead of remaining buried in raw blobs.
### Traceable to source
Records preserve source references, source-path derivations, and normalized JSON so users can validate extracted fields against upstream artifacts.
### Honest about quality
Legal data at scale is messy. Parser diagnostics and quality signals are included so downstream users can filter, inspect, or review records instead of assuming uniform quality.
## Why KanoonGPT is publishing this
We believe Indian legal AI needs **better open infrastructure**.
Open-source legal datasets help researchers, startups, and public-interest builders work from a common foundation instead of repeatedly rebuilding the same ingestion layer from scratch. KanoonGPT’s goal is to contribute usable, well-structured legal data that is easier to explore, benchmark, and build on.
## Responsible use
- Verify important legal facts against the original court record and judgment PDF before high-stakes use.
- Do not treat this dataset as legal advice.
- Source data may contain parser noise, missing fields, inconsistent formatting, or upstream errors.
- If you are building end-user systems, add your own validation, citations, and human review layers.
- Respect privacy, applicable law, and platform terms in downstream applications.
## Acknowledgements
This dataset stands on top of the broader Indian open legal data ecosystem. We are grateful to:
- [Pradeep Vanga](https://www.linkedin.com/in/pradeepvanga/) and contributors to the open court-data ecosystem
- [vanga/indian-high-court-judgments](https://github.com/vanga/indian-high-court-judgments)
- [vanga/indian-supreme-court-judgments](https://github.com/vanga/indian-supreme-court-judgments)
- AWS and the [Registry of Open Data on AWS](https://registry.opendata.aws/) for sponsoring public hosting and access infrastructure
## About KanoonGPT
KanoonGPT is building Indian legal datasets and data infrastructure for open-source, research, and enterprise AI applications.
**Coverage areas:**
- ⚖️ Case law
- 📜 Bare Acts
- 🏷️ Legal metadata
- 🧩 Structured legal corpora
Website: [kanoongpt.in](https://kanoongpt.in)
## Licensing
The dataset is released under the **`apache-2.0`**.
## Citation
If you use this dataset, please cite the Hugging Face dataset and, where relevant, the upstream open-data sources.
```bibtex
@dataset{kanoongpt_indian_case_laws,
author = {KanoonGPT},
title = {Indian Case Laws},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/KanoonGPT/indian-case-laws},
note = {Company website: https://kanoongpt.in}
}
```
提供机构:
KanoonGPT



