multi_lexsum
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/multi_lexsum
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Multi-LexSum
## Table of Contents
- [Dataset Card for Multi-LexSum](#dataset-card-for-multi-lexsum)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Languages](#languages)
- [Dataset](#dataset)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Sheet (Datasheet)](#dataset-sheet-datasheet)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Release History](#release-history)
## Dataset Description
- **Homepage:** https://multilexsum.github.io
- **Repository:** https://github.com/multilexsum/dataset
- **Paper:** https://arxiv.org/abs/2206.10883
<p>
<a href="https://multilexsum.github.io" style="display: inline-block;">
<img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum Paper" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a>
<a href="https://github.com/multilexsum/dataset" style="display: inline-block;">
<img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum Github Repo" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a>
<a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;">
<img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="Multi-LexSum is accepted in NeurIPS 2022" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a>
</p>
### Talk @ NeurIPS 2022
[](https://youtu.be/C-fwW_ZhkE8)
### Dataset Summary
The Multi-LexSum dataset is a collection of 9,280 such legal case summaries. Multi-LexSum is distinct from other datasets in its **multiple target summaries, each at a different granularity** (ranging from one-sentence “extreme” summaries to multi-paragraph narrations of over five hundred words). It presents a challenging multi-document summarization task given **the long length of the source documents**, often exceeding two hundred pages per case. Unlike other summarization datasets that are (semi-)automatically curated, Multi-LexSum consists of **expert-authored summaries**: the experts—lawyers and law students—are trained to follow carefully created guidelines, and their work is reviewed by an additional expert to ensure quality.
### Languages
English
## Dataset
### Data Fields
The dataset contains a list of instances (cases); each instance contains the following data:
| Field | Description |
| ------------: | -------------------------------------------------------------------------------: |
| id | `(str)` The case ID |
| sources | `(List[str])` A list of strings for the text extracted from the source documents |
| summary/long | `(str)` The long (multi-paragraph) summary for this case |
| summary/short | `(Optional[str])` The short (one-paragraph) summary for this case |
| summary/tiny | `(Optional[str])` The tiny (one-sentence) summary for this case |
Please check the exemplar usage below for loading the data:
```python
from datasets import load_dataset
multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518")
# Download multi_lexsum locally and load it as a Dataset object
example = multi_lexsum["validation"][0] # The first instance of the dev set
example["sources"] # A list of source document text for the case
for sum_len in ["long", "short", "tiny"]:
print(example["summary/" + sum_len]) # Summaries of three lengths
print(example['case_metadata']) # The corresponding metadata for a case in a dict
```
### Data Splits
| | Instances | Source Documents (D) | Long Summaries (L) | Short Summaries (S) | Tiny Summaries (T) | Total Summaries |
| ----------: | --------: | -------------------: | -----------------: | ------------------: | -----------------: | --------------: |
| Train (70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 |
| Test (20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 |
| Dev (10%) | 454 | 4,134 | 454 | 312 | 161 | 927 |
## Dataset Sheet (Datasheet)
Please check our [dataset sheet](https://multilexsum.github.io/datasheet) for details regarding dataset creation, source data, annotation, and considerations for the usage.
## Additional Information
### Dataset Curators
The dataset is created by the collaboration between Civil Rights Litigation Clearinghouse (CRLC, from University of Michigan) and Allen Institute for AI. Multi-LexSum builds on the dataset used and posted by the Clearinghouse to inform the public about civil rights litigation.
### Licensing Information
The Multi-LexSum dataset is distributed under the [Open Data Commons Attribution License (ODC-By)](https://opendatacommons.org/licenses/by/1-0/).
The case summaries and metadata are licensed under the [Creative Commons Attribution License (CC BY-NC)](https://creativecommons.org/licenses/by-nc/4.0/), and the source documents are already in the public domain.
Commercial users who desire a license for summaries and metadata can contact [info@clearinghouse.net](mailto:info@clearinghouse.net), which will allow free use but limit summary re-posting.
The corresponding code for downloading and loading the dataset is licensed under the Apache License 2.0.
### Citation Information
```
@article{Shen2022MultiLexSum,
author = {Zejiang Shen and
Kyle Lo and
Lauren Yu and
Nathan Dahlberg and
Margo Schlanger and
Doug Downey},
title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities},
journal = {CoRR},
volume = {abs/2206.10883},
year = {2022},****
url = {https://doi.org/10.48550/arXiv.2206.10883},
doi = {10.48550/arXiv.2206.10883}
}
```
## Release History
| Version | Description |
| ----------: | -----------------------------------------------------------: |
| `v20230518` | The v1.1 release including case and source document metadata |
| `v20220616` | The initial v1.0 release |
# Multi-LexSum 数据集卡片
## 目录
- [Multi-LexSum 数据集卡片](#dataset-card-for-multi-lexsum)
- [目录](#table-of-contents)
- [数据集概述](#dataset-description)
- [数据集概览](#dataset-summary)
- [语言](#languages)
- [数据集详情](#dataset)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集说明文档(Datasheet)](#dataset-sheet-datasheet)
- [附加信息](#additional-information)
- [数据集创建团队](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [发布历史](#release-history)
## 数据集概述
- **主页**:https://multilexsum.github.io
- **代码仓库**:https://github.com/multilexsum/dataset
- **相关论文**:https://arxiv.org/abs/2206.10883
<p>
<a href="https://multilexsum.github.io" style="display: inline-block;">
<img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum 项目主页" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a>
<a href="https://github.com/multilexsum/dataset" style="display: inline-block;">
<img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum GitHub 代码仓库" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a>
<a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;">
<img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="本数据集已被NeurIPS 2022收录" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a>
</p>
### NeurIPS 2022 主题演讲
[](https://youtu.be/C-fwW_ZhkE8)
### 数据集概览
Multi-LexSum数据集包含9280份法律案件摘要。与其他现有数据集相比,本数据集的核心特色在于**包含多份不同粒度的目标摘要**,覆盖范围从单句“极简”摘要到超过500词的多段落叙述文本。由于源文档篇幅极长,单份案件的源文档通常超过200页,本数据集带来了极具挑战性的多文档摘要任务。与其他(半)自动构建的摘要数据集不同,Multi-LexSum的所有摘要均为**专家撰写**:撰写者为律师及法学学生,均经过严格的标准化指南培训,且所有摘要均由另一位独立专家审核以确保内容质量。
### 语言
英语
## 数据集详情
### 数据字段
本数据集包含若干案件实例;每个实例包含以下字段:
| 字段名 | 描述 |
| ------------: | -------------------------------------------------------------------------------: |
| id | `(str)` 案件唯一标识符 |
| sources | `(List[str])` 从源文档中提取的文本字符串列表 |
| summary/long | `(str)` 本案件的长摘要,采用多段落格式 |
| summary/short | `(Optional[str])` 本案件的短摘要,采用单段落格式 |
| summary/tiny | `(Optional[str])` 本案件的极简摘要,采用单句格式 |
请参考以下示例代码加载数据集:
python
from datasets import load_dataset
multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518")
# 下载multi_lexsum数据集并加载为Dataset对象
example = multi_lexsum["validation"][0] # 获取开发集的第一个实例
example["sources"] # 获取该案件的源文档文本列表
for sum_len in ["long", "short", "tiny"]:
print(example["summary/" + sum_len]) # 打印三种不同长度的摘要
print(example['case_metadata']) # 打印该案件对应的元数据字典
### 数据划分
| | 实例总数 | 源文档数(D) | 长摘要数(L) | 短摘要数(S) | 极简摘要数(T) | 总摘要数 |
| ----------: | -------: | ----------: | ----------: | ----------: | ------------: | -------: |
| 训练集(70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 |
| 测试集(20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 |
| 开发集(10%) | 454 | 4,134 | 454 | 312 | 161 | 927 |
## 数据集说明文档(Datasheet)
请访问[数据集说明文档](https://multilexsum.github.io/datasheet),了解数据集构建流程、源数据来源、标注规范及使用注意事项的详细信息。
## 附加信息
### 数据集创建团队
本数据集由密歇根大学民权诉讼清算所(Civil Rights Litigation Clearinghouse, CRLC)与艾伦人工智能研究所(Allen Institute for AI)合作构建。Multi-LexSum基于清算所此前发布的、用于向公众普及民权诉讼相关信息的数据集构建而成。
### 许可信息
Multi-LexSum数据集整体采用[开放数据共同体署名许可协议(Open Data Commons Attribution License, ODC-By v1.0)](https://opendatacommons.org/licenses/by/1-0/)进行分发。案件摘要及元数据采用[知识共享署名-非商业性使用许可协议(Creative Commons Attribution-NonCommercial, CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)进行授权,源文档均已进入公共领域。如需使用摘要及元数据开展商业用途的用户,请联系[info@clearinghouse.net](mailto:info@clearinghouse.net),该授权可免费使用但限制摘要的二次发布。本数据集的下载及加载代码采用Apache License 2.0协议进行授权。
### 引用信息
@article{Shen2022MultiLexSum,
author = {Zejiang Shen and
Kyle Lo and
Lauren Yu and
Nathan Dahlberg and
Margo Schlanger and
Doug Downey},
title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities},
journal = {CoRR},
volume = {abs/2206.10883},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2206.10883},
doi = {10.48550/arXiv.2206.10883}
}
## 发布历史
| 版本号 | 描述 |
| ----------: | -----------------------------------------------------------: |
| `v20230518` | v1.1版本,新增案件及源文档元数据字段 |
| `v20220616` | 初始v1.0正式版本 |
提供机构:
maas
创建时间:
2025-05-27



