five

multi_lexsum

收藏
魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/multi_lexsum
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Multi-LexSum ## Table of Contents - [Dataset Card for Multi-LexSum](#dataset-card-for-multi-lexsum) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset](#dataset) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Sheet (Datasheet)](#dataset-sheet-datasheet) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Release History](#release-history) ## Dataset Description - **Homepage:** https://multilexsum.github.io - **Repository:** https://github.com/multilexsum/dataset - **Paper:** https://arxiv.org/abs/2206.10883 <p> <a href="https://multilexsum.github.io" style="display: inline-block;"> <img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum Paper" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://github.com/multilexsum/dataset" style="display: inline-block;"> <img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum Github Repo" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;"> <img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="Multi-LexSum is accepted in NeurIPS 2022" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> </p> ### Talk @ NeurIPS 2022 [![Watch the video](https://img.youtube.com/vi/C-fwW_ZhkE8/0.jpg)](https://youtu.be/C-fwW_ZhkE8) ### Dataset Summary The Multi-LexSum dataset is a collection of 9,280 such legal case summaries. Multi-LexSum is distinct from other datasets in its **multiple target summaries, each at a different granularity** (ranging from one-sentence “extreme” summaries to multi-paragraph narrations of over five hundred words). It presents a challenging multi-document summarization task given **the long length of the source documents**, often exceeding two hundred pages per case. Unlike other summarization datasets that are (semi-)automatically curated, Multi-LexSum consists of **expert-authored summaries**: the experts—lawyers and law students—are trained to follow carefully created guidelines, and their work is reviewed by an additional expert to ensure quality. ### Languages English ## Dataset ### Data Fields The dataset contains a list of instances (cases); each instance contains the following data: | Field | Description | | ------------: | -------------------------------------------------------------------------------: | | id | `(str)` The case ID | | sources | `(List[str])` A list of strings for the text extracted from the source documents | | summary/long | `(str)` The long (multi-paragraph) summary for this case | | summary/short | `(Optional[str])` The short (one-paragraph) summary for this case | | summary/tiny | `(Optional[str])` The tiny (one-sentence) summary for this case | Please check the exemplar usage below for loading the data: ```python from datasets import load_dataset multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518") # Download multi_lexsum locally and load it as a Dataset object example = multi_lexsum["validation"][0] # The first instance of the dev set example["sources"] # A list of source document text for the case for sum_len in ["long", "short", "tiny"]: print(example["summary/" + sum_len]) # Summaries of three lengths print(example['case_metadata']) # The corresponding metadata for a case in a dict ``` ### Data Splits | | Instances | Source Documents (D) | Long Summaries (L) | Short Summaries (S) | Tiny Summaries (T) | Total Summaries | | ----------: | --------: | -------------------: | -----------------: | ------------------: | -----------------: | --------------: | | Train (70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 | | Test (20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 | | Dev (10%) | 454 | 4,134 | 454 | 312 | 161 | 927 | ## Dataset Sheet (Datasheet) Please check our [dataset sheet](https://multilexsum.github.io/datasheet) for details regarding dataset creation, source data, annotation, and considerations for the usage. ## Additional Information ### Dataset Curators The dataset is created by the collaboration between Civil Rights Litigation Clearinghouse (CRLC, from University of Michigan) and Allen Institute for AI. Multi-LexSum builds on the dataset used and posted by the Clearinghouse to inform the public about civil rights litigation. ### Licensing Information The Multi-LexSum dataset is distributed under the [Open Data Commons Attribution License (ODC-By)](https://opendatacommons.org/licenses/by/1-0/). The case summaries and metadata are licensed under the [Creative Commons Attribution License (CC BY-NC)](https://creativecommons.org/licenses/by-nc/4.0/), and the source documents are already in the public domain. Commercial users who desire a license for summaries and metadata can contact [info@clearinghouse.net](mailto:info@clearinghouse.net), which will allow free use but limit summary re-posting. The corresponding code for downloading and loading the dataset is licensed under the Apache License 2.0. ### Citation Information ``` @article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022},**** url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} } ``` ## Release History | Version | Description | | ----------: | -----------------------------------------------------------: | | `v20230518` | The v1.1 release including case and source document metadata | | `v20220616` | The initial v1.0 release |

# Multi-LexSum 数据集卡片 ## 目录 - [Multi-LexSum 数据集卡片](#dataset-card-for-multi-lexsum) - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集概览](#dataset-summary) - [语言](#languages) - [数据集详情](#dataset) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集说明文档(Datasheet)](#dataset-sheet-datasheet) - [附加信息](#additional-information) - [数据集创建团队](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [发布历史](#release-history) ## 数据集概述 - **主页**:https://multilexsum.github.io - **代码仓库**:https://github.com/multilexsum/dataset - **相关论文**:https://arxiv.org/abs/2206.10883 <p> <a href="https://multilexsum.github.io" style="display: inline-block;"> <img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum 项目主页" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://github.com/multilexsum/dataset" style="display: inline-block;"> <img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum GitHub 代码仓库" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;"> <img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="本数据集已被NeurIPS 2022收录" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> </p> ### NeurIPS 2022 主题演讲 [![Watch the video](https://img.youtube.com/vi/C-fwW_ZhkE8/0.jpg)](https://youtu.be/C-fwW_ZhkE8) ### 数据集概览 Multi-LexSum数据集包含9280份法律案件摘要。与其他现有数据集相比,本数据集的核心特色在于**包含多份不同粒度的目标摘要**,覆盖范围从单句“极简”摘要到超过500词的多段落叙述文本。由于源文档篇幅极长,单份案件的源文档通常超过200页,本数据集带来了极具挑战性的多文档摘要任务。与其他(半)自动构建的摘要数据集不同,Multi-LexSum的所有摘要均为**专家撰写**:撰写者为律师及法学学生,均经过严格的标准化指南培训,且所有摘要均由另一位独立专家审核以确保内容质量。 ### 语言 英语 ## 数据集详情 ### 数据字段 本数据集包含若干案件实例;每个实例包含以下字段: | 字段名 | 描述 | | ------------: | -------------------------------------------------------------------------------: | | id | `(str)` 案件唯一标识符 | | sources | `(List[str])` 从源文档中提取的文本字符串列表 | | summary/long | `(str)` 本案件的长摘要,采用多段落格式 | | summary/short | `(Optional[str])` 本案件的短摘要,采用单段落格式 | | summary/tiny | `(Optional[str])` 本案件的极简摘要,采用单句格式 | 请参考以下示例代码加载数据集: python from datasets import load_dataset multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518") # 下载multi_lexsum数据集并加载为Dataset对象 example = multi_lexsum["validation"][0] # 获取开发集的第一个实例 example["sources"] # 获取该案件的源文档文本列表 for sum_len in ["long", "short", "tiny"]: print(example["summary/" + sum_len]) # 打印三种不同长度的摘要 print(example['case_metadata']) # 打印该案件对应的元数据字典 ### 数据划分 | | 实例总数 | 源文档数(D) | 长摘要数(L) | 短摘要数(S) | 极简摘要数(T) | 总摘要数 | | ----------: | -------: | ----------: | ----------: | ----------: | ------------: | -------: | | 训练集(70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 | | 测试集(20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 | | 开发集(10%) | 454 | 4,134 | 454 | 312 | 161 | 927 | ## 数据集说明文档(Datasheet) 请访问[数据集说明文档](https://multilexsum.github.io/datasheet),了解数据集构建流程、源数据来源、标注规范及使用注意事项的详细信息。 ## 附加信息 ### 数据集创建团队 本数据集由密歇根大学民权诉讼清算所(Civil Rights Litigation Clearinghouse, CRLC)与艾伦人工智能研究所(Allen Institute for AI)合作构建。Multi-LexSum基于清算所此前发布的、用于向公众普及民权诉讼相关信息的数据集构建而成。 ### 许可信息 Multi-LexSum数据集整体采用[开放数据共同体署名许可协议(Open Data Commons Attribution License, ODC-By v1.0)](https://opendatacommons.org/licenses/by/1-0/)进行分发。案件摘要及元数据采用[知识共享署名-非商业性使用许可协议(Creative Commons Attribution-NonCommercial, CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/)进行授权,源文档均已进入公共领域。如需使用摘要及元数据开展商业用途的用户,请联系[info@clearinghouse.net](mailto:info@clearinghouse.net),该授权可免费使用但限制摘要的二次发布。本数据集的下载及加载代码采用Apache License 2.0协议进行授权。 ### 引用信息 @article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} } ## 发布历史 | 版本号 | 描述 | | ----------: | -----------------------------------------------------------: | | `v20230518` | v1.1版本,新增案件及源文档元数据字段 | | `v20220616` | 初始v1.0正式版本 |
提供机构:
maas
创建时间:
2025-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作