LavanyaPobbathi/lamus-scotus-legal-arguments

Name: LavanyaPobbathi/lamus-scotus-legal-arguments
Creator: LavanyaPobbathi
Published: 2026-01-31 08:53:57
License: 暂无描述

Hugging Face2026-01-31 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/LavanyaPobbathi/lamus-scotus-legal-arguments

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: mit task_categories: - text-classification tags: - legal - argument-mining - supreme-court - legal-nlp - scotus - law - nlp - classification pretty_name: "LAMUS: Legal Argument Mining from U.S. Supreme Court" size_categories: - 1M<n<10M dataset_info: features: - name: row_id dtype: int64 - name: sentence dtype: string - name: case_title dtype: string - name: citation dtype: string - name: docket_number dtype: string - name: source_field dtype: string - name: court dtype: string - name: year dtype: int64 - name: source_file dtype: string - name: Predicted_Label dtype: string - name: source dtype: string - name: date_decided dtype: string - name: url dtype: string splits: - name: scotus_all_courts num_examples: 2900083 configs: - config_name: default data_files: - split: scotus_all_courts path: data/scotus_all_courts.csv --- # LAMUS: Legal Argument Mining from U.S. Supreme Court <div align="center"> [![Dataset](https://img.shields.io/badge/Dataset-2.9M%20sentences-blue)](https://huggingface.co/datasets/LavanyaPobbathi/lamus-scotus-legal-arguments) [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) [![Task](https://img.shields.io/badge/Task-Text%20Classification-orange)](https://huggingface.co/tasks/text-classification) </div> ## 📋 Dataset Description This dataset contains **2,900,083 sentences** from U.S. Supreme Court opinions spanning **1921-2025**, automatically labeled with legal argument categories. This is the **largest publicly available labeled dataset** for legal argument mining from U.S. caselaw. ### 🎯 Purpose The dataset enables: - **Legal Argument Mining** research - **Legal Text Classification** model training - **Temporal Analysis** of judicial writing styles - **Cross-court Comparison** studies ### 📊 Dataset Statistics | Metric | Value | |--------|-------| | **Total Sentences** | 2,900,083 | | **Supreme Court Eras** | 8 (1921-2025) | | **Label Categories** | 6 | | **Labeling Model Accuracy** | 85.16% | | **File Size** | ~987 MB | ## 🏛️ Supreme Court Eras Covered | Court Era | Chief Justice | Years | Sentences | % of Dataset | |-----------|---------------|-------|-----------|--------------| | Burger Court | Warren Burger | 1969-1986 | 809,409 | 27.9% | | Rehnquist Court | William Rehnquist | 1986-2005 | 673,564 | 23.2% | | Warren Court | Earl Warren | 1953-1969 | 377,645 | 13.0% | | Roberts Court | John Roberts | 2005-2025 | 362,891 | 12.5% | | Hughes Court | Charles E. Hughes | 1930-1941 | 213,122 | 7.4% | | Vinson Court | Fred Vinson | 1946-1953 | 170,975 | 5.9% | | Taft Court | William H. Taft | 1921-1930 | 155,066 | 5.3% | | Stone Court | Harlan F. Stone | 1941-1946 | 137,411 | 4.7% | ## 🏷️ Label Categories | Label | Description | Count | Percentage | |-------|-------------|-------|------------| | **Analysis** | Legal reasoning, interpretation, argumentation | 799,921 | 27.6% | | **Rule/Law/Holding** | Legal rules, statutes, precedents, holdings | 799,324 | 27.6% | | **Facts** | Background details, case history, evidence | 763,106 | 26.3% | | **Others** | Procedural text, citations, headers | 354,784 | 12.2% | | **Conclusion** | Final decisions, judgments, outcomes | 123,137 | 4.2% | | **Issue** | Legal questions being addressed | 59,811 | 2.1% | ## 📁 Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `row_id` | int64 | Unique identifier | | `sentence` | string | The legal sentence text | | `case_title` | string | Name of the case | | `citation` | string | Legal citation (e.g., "410 U.S. 113") | | `docket_number` | string | Court docket number | | `source_field` | string | Part of opinion (syllabus, opinion, etc.) | | `court` | string | Supreme Court era | | `year` | int64 | Year of decision | | `source_file` | string | Original source file | | `Predicted_Label` | string | ML-predicted argument label | | `source` | string | Data source | | `date_decided` | string | Decision date | | `url` | string | Link to original opinion | ### Example Entry ```json { "row_id": 0, "sentence": "The defendant was convicted of first-degree murder.", "case_title": "Smith v. United States", "citation": "500 U.S. 123", "court": "Rehnquist Court", "year": 1991, "Predicted_Label": "Facts" } ``` ## 🚀 Usage ### Loading the Dataset ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("LavanyaPobbathi/lamus-scotus-legal-arguments") # Access the data scotus_data = dataset["scotus_all_courts"] print(f"Total sentences: {len(scotus_data):,}") # Example: Filter by court era burger_court = [s for s in scotus_data if s["court"] == "Burger Court"] ``` ### Loading as Pandas DataFrame ```python import pandas as pd from datasets import load_dataset dataset = load_dataset("LavanyaPobbathi/lamus-scotus-legal-arguments") df = dataset["scotus_all_courts"].to_pandas() # Analyze label distribution print(df["Predicted_Label"].value_counts()) ``` ### Filter by Court Era ```python # Get only Roberts Court (2005-present) roberts = df[df["court"] == "Roberts Court"] print(f"Roberts Court sentences: {len(roberts):,}") ``` ## 🔬 Methodology ### Labeling Model The sentences were labeled using a **fine-tuned Llama-3-8B** model with the following specifications: | Parameter | Value | |-----------|-------| | Base Model | Meta-Llama-3-8B-Instruct | | Method | QLoRA (4-bit quantization) | | Learning Rate | 2e-4 | | LoRA Rank | 16 | | Epochs | 3 | | **Test Accuracy** | **85.16%** | ### Training Data The model was trained on **2,585 manually annotated sentences** from Texas criminal court cases and validated on **647 test sentences**. ### Validation - Stratified train/test split (80/20) - 6-class classification task - Macro F1: 0.69, Weighted F1: 0.80 ## 📈 Key Research Findings 1. **Fine-tuning dramatically outperforms prompting** (+9.27% accuracy) 2. **General-domain LLMs outperform legal-specific models** (surprising finding) 3. **Few-shot prompting decreases accuracy** (important negative result) 4. **Significant domain shift** between trial courts (Facts-heavy) and SCOTUS (Rule/Law-heavy) ## 📚 Related Resources - **Training Data**: Texas Criminal Cases (available upon request) - **Model**: Fine-tuned Llama-3-8B (available upon request) - **Paper**: [Coming Soon - ICAIL 2026] ## 📖 Citation If you use this dataset in your research, please cite: ```bibtex @dataset{lamus2026, title={LAMUS: Legal Argument Mining from U.S. Supreme Court Using Large Language Models}, author={Pobbathi, Lavanya and Wang, Serene and Chen, Haihua}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/LavanyaPobbathi/lamus-scotus-legal-arguments} } ``` ## 📄 License This dataset is released under the [MIT License](https://opensource.org/licenses/MIT). The underlying Supreme Court opinions are in the public domain as U.S. government works. ## 👥 Authors - **Lavanya Pobbathi** - University of North Texas - **Serene Wang** - University of North Texas - **Haihua Chen** - University of North Texas (Supervisor) ## 📧 Contact For questions or feedback, please open an issue on this dataset repository or contact the authors. ---

提供机构：

LavanyaPobbathi

5,000+

优质数据集

54 个

任务类型

进入经典数据集