LavanyaPobbathi/lamus-scotus-legal-arguments
收藏Hugging Face2026-01-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LavanyaPobbathi/lamus-scotus-legal-arguments
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
task_categories:
- text-classification
tags:
- legal
- argument-mining
- supreme-court
- legal-nlp
- scotus
- law
- nlp
- classification
pretty_name: "LAMUS: Legal Argument Mining from U.S. Supreme Court"
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: row_id
dtype: int64
- name: sentence
dtype: string
- name: case_title
dtype: string
- name: citation
dtype: string
- name: docket_number
dtype: string
- name: source_field
dtype: string
- name: court
dtype: string
- name: year
dtype: int64
- name: source_file
dtype: string
- name: Predicted_Label
dtype: string
- name: source
dtype: string
- name: date_decided
dtype: string
- name: url
dtype: string
splits:
- name: scotus_all_courts
num_examples: 2900083
configs:
- config_name: default
data_files:
- split: scotus_all_courts
path: data/scotus_all_courts.csv
---
# LAMUS: Legal Argument Mining from U.S. Supreme Court
<div align="center">
[](https://huggingface.co/datasets/LavanyaPobbathi/lamus-scotus-legal-arguments)
[](https://opensource.org/licenses/MIT)
[](https://huggingface.co/tasks/text-classification)
</div>
## 📋 Dataset Description
This dataset contains **2,900,083 sentences** from U.S. Supreme Court opinions spanning **1921-2025**, automatically labeled with legal argument categories. This is the **largest publicly available labeled dataset** for legal argument mining from U.S. caselaw.
### 🎯 Purpose
The dataset enables:
- **Legal Argument Mining** research
- **Legal Text Classification** model training
- **Temporal Analysis** of judicial writing styles
- **Cross-court Comparison** studies
### 📊 Dataset Statistics
| Metric | Value |
|--------|-------|
| **Total Sentences** | 2,900,083 |
| **Supreme Court Eras** | 8 (1921-2025) |
| **Label Categories** | 6 |
| **Labeling Model Accuracy** | 85.16% |
| **File Size** | ~987 MB |
## 🏛️ Supreme Court Eras Covered
| Court Era | Chief Justice | Years | Sentences | % of Dataset |
|-----------|---------------|-------|-----------|--------------|
| Burger Court | Warren Burger | 1969-1986 | 809,409 | 27.9% |
| Rehnquist Court | William Rehnquist | 1986-2005 | 673,564 | 23.2% |
| Warren Court | Earl Warren | 1953-1969 | 377,645 | 13.0% |
| Roberts Court | John Roberts | 2005-2025 | 362,891 | 12.5% |
| Hughes Court | Charles E. Hughes | 1930-1941 | 213,122 | 7.4% |
| Vinson Court | Fred Vinson | 1946-1953 | 170,975 | 5.9% |
| Taft Court | William H. Taft | 1921-1930 | 155,066 | 5.3% |
| Stone Court | Harlan F. Stone | 1941-1946 | 137,411 | 4.7% |
## 🏷️ Label Categories
| Label | Description | Count | Percentage |
|-------|-------------|-------|------------|
| **Analysis** | Legal reasoning, interpretation, argumentation | 799,921 | 27.6% |
| **Rule/Law/Holding** | Legal rules, statutes, precedents, holdings | 799,324 | 27.6% |
| **Facts** | Background details, case history, evidence | 763,106 | 26.3% |
| **Others** | Procedural text, citations, headers | 354,784 | 12.2% |
| **Conclusion** | Final decisions, judgments, outcomes | 123,137 | 4.2% |
| **Issue** | Legal questions being addressed | 59,811 | 2.1% |
## 📁 Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `row_id` | int64 | Unique identifier |
| `sentence` | string | The legal sentence text |
| `case_title` | string | Name of the case |
| `citation` | string | Legal citation (e.g., "410 U.S. 113") |
| `docket_number` | string | Court docket number |
| `source_field` | string | Part of opinion (syllabus, opinion, etc.) |
| `court` | string | Supreme Court era |
| `year` | int64 | Year of decision |
| `source_file` | string | Original source file |
| `Predicted_Label` | string | ML-predicted argument label |
| `source` | string | Data source |
| `date_decided` | string | Decision date |
| `url` | string | Link to original opinion |
### Example Entry
```json
{
"row_id": 0,
"sentence": "The defendant was convicted of first-degree murder.",
"case_title": "Smith v. United States",
"citation": "500 U.S. 123",
"court": "Rehnquist Court",
"year": 1991,
"Predicted_Label": "Facts"
}
```
## 🚀 Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("LavanyaPobbathi/lamus-scotus-legal-arguments")
# Access the data
scotus_data = dataset["scotus_all_courts"]
print(f"Total sentences: {len(scotus_data):,}")
# Example: Filter by court era
burger_court = [s for s in scotus_data if s["court"] == "Burger Court"]
```
### Loading as Pandas DataFrame
```python
import pandas as pd
from datasets import load_dataset
dataset = load_dataset("LavanyaPobbathi/lamus-scotus-legal-arguments")
df = dataset["scotus_all_courts"].to_pandas()
# Analyze label distribution
print(df["Predicted_Label"].value_counts())
```
### Filter by Court Era
```python
# Get only Roberts Court (2005-present)
roberts = df[df["court"] == "Roberts Court"]
print(f"Roberts Court sentences: {len(roberts):,}")
```
## 🔬 Methodology
### Labeling Model
The sentences were labeled using a **fine-tuned Llama-3-8B** model with the following specifications:
| Parameter | Value |
|-----------|-------|
| Base Model | Meta-Llama-3-8B-Instruct |
| Method | QLoRA (4-bit quantization) |
| Learning Rate | 2e-4 |
| LoRA Rank | 16 |
| Epochs | 3 |
| **Test Accuracy** | **85.16%** |
### Training Data
The model was trained on **2,585 manually annotated sentences** from Texas criminal court cases and validated on **647 test sentences**.
### Validation
- Stratified train/test split (80/20)
- 6-class classification task
- Macro F1: 0.69, Weighted F1: 0.80
## 📈 Key Research Findings
1. **Fine-tuning dramatically outperforms prompting** (+9.27% accuracy)
2. **General-domain LLMs outperform legal-specific models** (surprising finding)
3. **Few-shot prompting decreases accuracy** (important negative result)
4. **Significant domain shift** between trial courts (Facts-heavy) and SCOTUS (Rule/Law-heavy)
## 📚 Related Resources
- **Training Data**: Texas Criminal Cases (available upon request)
- **Model**: Fine-tuned Llama-3-8B (available upon request)
- **Paper**: [Coming Soon - ICAIL 2026]
## 📖 Citation
If you use this dataset in your research, please cite:
```bibtex
@dataset{lamus2026,
title={LAMUS: Legal Argument Mining from U.S. Supreme Court Using Large Language Models},
author={Pobbathi, Lavanya and Wang, Serene and Chen, Haihua},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/LavanyaPobbathi/lamus-scotus-legal-arguments}
}
```
## 📄 License
This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).
The underlying Supreme Court opinions are in the public domain as U.S. government works.
## 👥 Authors
- **Lavanya Pobbathi** - University of North Texas
- **Serene Wang** - University of North Texas
- **Haihua Chen** - University of North Texas (Supervisor)
## 📧 Contact
For questions or feedback, please open an issue on this dataset repository or contact the authors.
---
提供机构:
LavanyaPobbathi



