cold-cases
收藏魔搭社区2025-10-17 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Virgo-Internal/cold-cases
下载链接
链接失效反馈官方服务:
资源简介:
<img src="https://huggingface.co/datasets/harvard-lil/cold-cases/resolve/main/coldcases-banner.webp"/>
# Collaborative Open Legal Data (COLD) - Cases
COLD Cases is a dataset of 8.3 million United States legal decisions with text and metadata, formatted as compressed parquet files. If you'd like to view a sample of the dataset formatted as JSON Lines, you can view one [here](https://raw.githubusercontent.com/harvard-lil/cold-cases-export/main/sample.jsonl)
This dataset exists to support the open legal movement exemplified by projects like
[Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law) and
[LegalBench](https://hazyresearch.stanford.edu/legalbench/).
A key input to legal understanding projects is caselaw -- the published, precedential decisions of judges deciding legal disputes and explaining their reasoning.
United States caselaw is collected and published as open data by [CourtListener](https://www.courtlistener.com/), which maintains scrapers to aggregate data from
a wide range of public sources.
COLD Cases reformats CourtListener's [bulk data](https://www.courtlistener.com/help/api/bulk-data) so that all of the semantic information about each legal decision
(the authors and text of majority and dissenting opinions; head matter; and substantive metadata) is encoded in a single record per decision,
with extraneous data removed. Serving in the traditional role of libraries as a standardization steward, the Harvard Library Innovation Lab is maintaining
this [open source](https://github.com/harvard-lil/cold-cases-export) pipeline to consolidate the data engineering for preprocessing caselaw so downstream machine
learning and natural language processing projects can use consistent, high quality representations of cases for legal understanding tasks.
Prepared by the [Harvard Library Innovation Lab](https://lil.law.harvard.edu) in collaboration with the [Free Law Project](https://free.law/).
---
## Links
- [Data nutrition label](https://datanutrition.org/labels/v3/?id=c29976b2-858c-4f4e-b7d0-c8ef12ce7dbe) (DRAFT). ([Archive](https://perma.cc/YV5P-B8JL)).
- [Pipeline source code](https://github.com/harvard-lil/cold-cases-export)
---
## Summary
- [Format](#format)
- [Data dictionary](#data-dictionary)
- [Notes on appropriate use](#notes-on-appropriate-use)
---
## Format
[Apache Parquet](https://parquet.apache.org/) is binary format that makes filtering and retrieving the data quicker because it lays out the data in columns, which means columns that are unnecessary to satisfy a given query or workflow don't need to be read. Hugging Face's [Datasets](https://huggingface.co/docs/datasets/index) library is an easy way to get started working with the entire dataset, and has features for loading and streaming the data, so you don't need to store it all locally or pay attention to how it's formatted on disk.
[☝️ Go back to Summary](#summary)
---
## Data dictionary
Partial glossary of the fields in the data.
| Field name | Description |
| --- | --- |
| `judges` | Names of judges presiding over the case, extracted from the text. |
| `date_filed` | Date the case was filed. Formatted in ISO Date format. |
| `date_filed_is_approximate` | Boolean representing whether the `date_filed` value is precise to the day. |
| `slug` | Short, human-readable unique string nickname for the case. |
| `case_name_short` | Short name for the case. |
| `case_name` | Fuller name for the case. |
| `case_name_full` | Full, formal name for the case. |
| `attorneys` | Names of attorneys arguing the case, extracted from the text. |
| `nature_of_suit` | Free text representinng type of suit, such as Civil, Tort, etc. |
| `syllabus` | Summary of the questions addressed in the decision, if provided by the reporter of decisions. |
| `headnotes` | Textual headnotes of the case |
| `summary` | Textual summary of the case |
| `disposition` | How the court disposed of the case in their final ruling. |
| `history` | Textual information about what happened to this case in later decisions. |
| `other_dates` | Other dates related to the case in free text. |
| `cross_reference` | Citations to related cases. |
| `citation_count` | Number of cases that cite this one. |
| `precedential_status` | Constrainted to the values "Published", "Unknown", "Errata", "Unpublished", "Relating-to", "Separate", "In-chambers" |
| `citations` | Cases that cite this case. |
| `court_short_name` | Short name of court presiding over case. |
| `court_full_name` | Full name of court presiding over case. |
| `court_jurisdiction` | Code for type of court that presided over the case. See: [court_jurisdiction field values](#court_jurisdiction-field-values) |
| `opinions` | An array of subrecords. |
| `opinions.author_str` | Name of the author of an individual opinion. |
| `opinions.per_curiam` | Boolean representing whether the opinion was delivered by an entire court or a single judge. |
| `opinions.type` | One of `"010combined"`, `"015unamimous"`, `"020lead"`, `"025plurality"`, `"030concurrence"`, `"035concurrenceinpart"`, `"040dissent"`, `"050addendum"`, `"060remittitur"`, `"070rehearing"`, `"080onthemerits"`, `"090onmotiontostrike"`. |
| `opinions.opinion_text` | Actual full text of the opinion. |
| `opinions.ocr` | Whether the opinion was captured via optical character recognition or born-digital text. |
### court_type field values
| Value | Description |
| --- | --- |
| F | Federal Appellate |
| FD | Federal District |
| FB | Federal Bankruptcy |
| FBP | Federal Bankruptcy Panel |
| FS | Federal Special |
| S | State Supreme |
| SA | State Appellate |
| ST | State Trial |
| SS | State Special |
| TRS | Tribal Supreme |
| TRA | Tribal Appellate |
| TRT | Tribal Trial |
| TRX | Tribal Special |
| TS | Territory Supreme |
| TA | Territory Appellate |
| TT | Territory Trial |
| TSP | Territory Special |
| SAG | State Attorney General |
| MA | Military Appellate |
| MT | Military Trial |
| C | Committee |
| I | International |
| T | Testing |
[☝️ Go back to Summary](#summary)
---
## Notes on appropriate use
When using this data, please keep in mind:
* All documents in this dataset are public information, published by courts within the United States to inform the public about the law. **You have a right to access them.**
* Nevertheless, **public court decisions frequently contain statements about individuals that are not true**. Court decisions often contain claims that are disputed,
or false claims taken as true based on a legal technicality, or claims taken as true but later found to be false. Legal decisions are designed to inform you about the law -- they are not
designed to inform you about individuals, and should not be used in place of credit databases, criminal records databases, news articles, or other sources intended
to provide factual personal information. Applications should carefully consider whether use of this data will inform about the law, or mislead about individuals.
* **Court decisions are not up-to-date statements of law**. Each decision provides a given judge's best understanding of the law as applied to the stated facts
at the time of the decision. Use of this data to generate statements about the law requires integration of a large amount of context --
the skill typically provided by lawyers -- rather than simple data retrieval.
To mitigate privacy risks, we have filtered out cases [blocked or deindexed by CourtListener](https://www.courtlistener.com/terms/#removal). Researchers who
require access to the full dataset without that filter may rerun our pipeline on CourtListener's raw data.
[☝️ Go back to Summary](#summary)
<img src="https://huggingface.co/datasets/harvard-lil/cold-cases/resolve/main/coldcases-banner.webp"/>
# 协作式开放法律数据(Collaborative Open Legal Data,简称COLD)——判例集
COLD判例集是包含830万份美国司法判例文本及元数据的数据集,以压缩Apache Parquet文件格式存储。若您希望查看该数据集的JSON Lines格式样本,可点击[此处](https://raw.githubusercontent.com/harvard-lil/cold-cases-export/main/sample.jsonl)访问。
本数据集旨在支撑以《法律文库》(Pile of Law)和《法律基准》(LegalBench)为代表的开放法律运动。法律理解类项目的核心输入之一是判例法(caselaw)——即法官针对法律争议作出并阐释裁判理由的公开先例判决。美国判例法由[CourtListener](https://www.courtlistener.com/)以开放数据形式收集并发布,该平台维护了爬虫程序,可从各类公开数据源聚合数据。
COLD判例集对CourtListener的[批量数据(bulk data)](https://www.courtlistener.com/help/api/bulk-data)进行了重新格式化,使每份司法判例的全部语义信息(包括多数意见与反对意见的撰写者及文本、前置内容,以及实体元数据)均编码为单条记录,并剔除了冗余数据。哈佛大学图书馆创新实验室(Harvard Library Innovation Lab)秉承图书馆作为标准化管理方的传统职能,维护了该[开源](https://github.com/harvard-lil/cold-cases-export)数据处理流水线,整合了判例法预处理环节的工程工作,使下游机器学习与自然语言处理(Natural Language Processing,简称NLP)项目能够使用统一且高质量的判例表征开展法律理解类任务。
本数据集由哈佛大学图书馆创新实验室(Harvard Library Innovation Lab)与[Free Law Project](https://free.law/)合作制作。
---
## 相关链接
- [数据营养标签](https://datanutrition.org/labels/v3/?id=c29976b2-858c-4f4e-b7d0-c8ef12ce7dbe)(草案)。[存档链接](https://perma.cc/YV5P-B8JL)。
- [流水线源代码](https://github.com/harvard-lil/cold-cases-export)
---
## 摘要
- [数据格式](#数据格式)
- [数据字典](#数据字典)
- [合理使用须知](#合理使用须知)
---
## 数据格式
[Apache Parquet](https://parquet.apache.org/)是一种二进制数据格式,通过按列存储数据实现更快的数据筛选与检索,这意味着在执行特定查询或工作流时,无需读取无关列。Hugging Face的[Datasets](https://huggingface.co/docs/datasets/index)库是快速上手全量数据集的便捷工具,支持数据加载与流式读取,无需将数据集全部存储至本地,也无需关注其磁盘存储格式。
[☝️ 返回摘要](#摘要)
---
## 数据字典
本部分为数据字段的部分术语表。
| 字段名 | 字段说明 |
| --- | --- |
| `judges` | 从文本中提取的承办法官姓名 |
| `date_filed` | 案件立案日期,采用ISO日期格式 |
| `date_filed_is_approximate` | 布尔值,标识`date_filed`字段是否为精确到日的日期 |
| `slug` | 案件的简短可读唯一标识字符串 |
| `case_name_short` | 案件简称 |
| `case_name` | 案件名称 |
| `case_name_full` | 案件正式全称 |
| `attorneys` | 从文本中提取的本案代理律师姓名 |
| `nature_of_suit` | 自由文本形式的诉讼性质,如民事、侵权等 |
| `syllabus` | 由判例报告人提供的案件争议要点摘要(若有) |
| `headnotes` | 案件判例提要文本 |
| `summary` | 案件文本摘要 |
| `disposition` | 法院最终裁判的处理结果 |
| `history` | 本案在后续判决中的相关信息文本 |
| `other_dates` | 自由文本形式的其他案件相关日期 |
| `cross_reference` | 相关案件的交叉引用 |
| `citation_count` | 引用本案的案件数量 |
| `precedential_status` | 枚举字段,取值限定为:"Published"、"Unknown"、"Errata"、"Unpublished"、"Relating-to"、"Separate"、"In-chambers" |
| `citations` | 引用本案的案件列表 |
| `court_short_name` | 审理本案的法院简称 |
| `court_full_name` | 审理本案的法院全称 |
| `court_jurisdiction` | 审理本案的法院管辖代码,详见:[court_jurisdiction字段取值说明](#court_jurisdiction-field-values) |
| `opinions` | 子记录数组 |
| `opinions.author_str` | 单条司法意见的撰写者姓名 |
| `opinions.per_curiam` | 布尔值,标识该意见是否由合议庭作出而非单个法官 |
| `opinions.type` | 枚举字段,取值为:"010combined"、"015unamimous"、"020lead"、"025plurality"、"030concurrence"、"035concurrenceinpart"、"040dissent"、"050addendum"、"060remittitur"、"070rehearing"、"080onthemerits"、"090onmotiontostrike" |
| `opinions.opinion_text` | 司法意见的完整原文 |
| `opinions.ocr` | 标识该意见是否通过光学字符识别(Optical Character Recognition,简称OCR)获取而非原生数字文本 |
### 法院类型字段取值
| 代码 | 说明 |
| --- | --- |
| F | 联邦上诉法院 |
| FD | 联邦地区法院 |
| FB | 联邦破产法院 |
| FBP | 联邦破产合议庭 |
| FS | 联邦专门法院 |
| S | 州最高法院 |
| SA | 州上诉法院 |
| ST | 州初审法院 |
| SS | 州专门法院 |
| TRS | 部落最高法院 |
| TRA | 部落上诉法院 |
| TRT | 部落初审法院 |
| TRX | 部落专门法院 |
| TS | 领地最高法院 |
| TA | 领地上诉法院 |
| TT | 领地初审法院 |
| TSP | 领地专门法院 |
| SAG | 州总检察长办公室 |
| MA | 军事上诉法院 |
| MT | 军事初审法院 |
| C | 委员会 |
| I | 国际法院 |
| T | 测试法院 |
[☝️ 返回摘要](#摘要)
---
## 合理使用须知
使用本数据集时,请留意以下要点:
* 本数据集内的所有文档均为美国法院发布的公开信息,旨在向公众普及法律知识。**您有权访问这些文档。**
* 尽管如此,**公开的法院判决中常包含与个人相关的不实表述**。法院判决中可能包含存在争议的主张、基于法律技术性规则被认定为真实的虚假主张,或最初被认定为真实但后续被证伪的主张。法院判决的目的是阐释法律,而非提供个人相关信息,因此不得将其用作征信数据库、刑事记录数据库、新闻报道或其他旨在提供准确个人信息的来源的替代品。相关应用需审慎评估:使用本数据集是为了阐释法律,还是会误导公众对个人的认知。
* **法院判决并非现行有效的法律表述**。每份判决仅代表作出判决时,承办法官对适用于涉案事实的法律的最佳理解。若使用本数据集生成关于法律的表述,需整合大量上下文信息——这通常需要律师具备的专业能力,而非仅依赖简单的数据检索。
为降低隐私风险,我们已过滤掉[CourtListener平台屏蔽或取消索引的案件](https://www.courtlistener.com/terms/#removal)。若研究人员需要无过滤的全量数据集,可基于CourtListener的原始数据重新运行本流水线。
[☝️ 返回摘要](#摘要)
提供机构:
maas
创建时间:
2025-08-18



