下载链接：

https://modelscope.cn/datasets/allenai/multi_lexsum

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Multi-LexSum ## Table of Contents - [Dataset Card for Multi-LexSum](#dataset-card-for-multi-lexsum) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset](#dataset) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Sheet (Datasheet)](#dataset-sheet-datasheet) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Release History](#release-history) ## Dataset Description - **Homepage:** https://multilexsum.github.io - **Repository:** https://github.com/multilexsum/dataset - **Paper:** https://arxiv.org/abs/2206.10883 <p> <a href="https://multilexsum.github.io" style="display: inline-block;"> <img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum Paper" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://github.com/multilexsum/dataset" style="display: inline-block;"> <img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum Github Repo" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;"> <img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="Multi-LexSum is accepted in NeurIPS 2022" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> </p> ### Talk @ NeurIPS 2022 [![Watch the video](https://img.youtube.com/vi/C-fwW_ZhkE8/0.jpg)](https://youtu.be/C-fwW_ZhkE8) ### Dataset Summary The Multi-LexSum dataset is a collection of 9,280 such legal case summaries. Multi-LexSum is distinct from other datasets in its **multiple target summaries, each at a different granularity** (ranging from one-sentence “extreme” summaries to multi-paragraph narrations of over five hundred words). It presents a challenging multi-document summarization task given **the long length of the source documents**, often exceeding two hundred pages per case. Unlike other summarization datasets that are (semi-)automatically curated, Multi-LexSum consists of **expert-authored summaries**: the experts—lawyers and law students—are trained to follow carefully created guidelines, and their work is reviewed by an additional expert to ensure quality. ### Languages English ## Dataset ### Data Fields The dataset contains a list of instances (cases); each instance contains the following data: | Field | Description | | ------------: | -------------------------------------------------------------------------------: | | id | `(str)` The case ID | | sources | `(List[str])` A list of strings for the text extracted from the source documents | | summary/long | `(str)` The long (multi-paragraph) summary for this case | | summary/short | `(Optional[str])` The short (one-paragraph) summary for this case | | summary/tiny | `(Optional[str])` The tiny (one-sentence) summary for this case | Please check the exemplar usage below for loading the data: ```python from datasets import load_dataset multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518") # Download multi_lexsum locally and load it as a Dataset object example = multi_lexsum["validation"][0] # The first instance of the dev set example["sources"] # A list of source document text for the case for sum_len in ["long", "short", "tiny"]: print(example["summary/" + sum_len]) # Summaries of three lengths print(example['case_metadata']) # The corresponding metadata for a case in a dict ``` ### Data Splits | | Instances | Source Documents (D) | Long Summaries (L) | Short Summaries (S) | Tiny Summaries (T) | Total Summaries | | ----------: | --------: | -------------------: | -----------------: | ------------------: | -----------------: | --------------: | | Train (70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 | | Test (20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 | | Dev (10%) | 454 | 4,134 | 454 | 312 | 161 | 927 | ## Dataset Sheet (Datasheet) Please check our [dataset sheet](https://multilexsum.github.io/datasheet) for details regarding dataset creation, source data, annotation, and considerations for the usage. ## Additional Information ### Dataset Curators The dataset is created by the collaboration between Civil Rights Litigation Clearinghouse (CRLC, from University of Michigan) and Allen Institute for AI. Multi-LexSum builds on the dataset used and posted by the Clearinghouse to inform the public about civil rights litigation. ### Licensing Information The Multi-LexSum dataset is distributed under the [Open Data Commons Attribution License (ODC-By)](https://opendatacommons.org/licenses/by/1-0/). The case summaries and metadata are licensed under the [Creative Commons Attribution License (CC BY-NC)](https://creativecommons.org/licenses/by-nc/4.0/), and the source documents are already in the public domain. Commercial users who desire a license for summaries and metadata can contact [info@clearinghouse.net](mailto:info@clearinghouse.net), which will allow free use but limit summary re-posting. The corresponding code for downloading and loading the dataset is licensed under the Apache License 2.0. ### Citation Information ``` @article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022},**** url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} } ``` ## Release History | Version | Description | | ----------: | -----------------------------------------------------------: | | `v20230518` | The v1.1 release including case and source document metadata | | `v20220616` | The initial v1.0 release |

# Multi-LexSum 数据集卡片 ## 目录 - [Multi-LexSum 数据集卡片](#dataset-card-for-multi-lexsum) - [目录](#table-of-contents) - [数据集概述](#dataset-description) - [数据集概览](#dataset-summary) - [语言](#languages) - [数据集详情](#dataset) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集说明文档（Datasheet）](#dataset-sheet-datasheet) - [附加信息](#additional-information) - [数据集创建团队](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [发布历史](#release-history) ## 数据集概述 - **主页**：https://multilexsum.github.io - **代码仓库**：https://github.com/multilexsum/dataset - **相关论文**：https://arxiv.org/abs/2206.10883 <p> <a href="https://multilexsum.github.io" style="display: inline-block;"> <img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum 项目主页" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://github.com/multilexsum/dataset" style="display: inline-block;"> <img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum GitHub 代码仓库" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;"> <img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="本数据集已被NeurIPS 2022收录" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> </p> ### NeurIPS 2022 主题演讲 [![Watch the video](https://img.youtube.com/vi/C-fwW_ZhkE8/0.jpg)](https://youtu.be/C-fwW_ZhkE8) ### 数据集概览 Multi-LexSum数据集包含9280份法律案件摘要。与其他现有数据集相比，本数据集的核心特色在于**包含多份不同粒度的目标摘要**，覆盖范围从单句“极简”摘要到超过500词的多段落叙述文本。由于源文档篇幅极长，单份案件的源文档通常超过200页，本数据集带来了极具挑战性的多文档摘要任务。与其他（半）自动构建的摘要数据集不同，Multi-LexSum的所有摘要均为**专家撰写**：撰写者为律师及法学学生，均经过严格的标准化指南培训，且所有摘要均由另一位独立专家审核以确保内容质量。 ### 语言英语 ## 数据集详情 ### 数据字段本数据集包含若干案件实例；每个实例包含以下字段： | 字段名 | 描述 | | ------------: | -------------------------------------------------------------------------------: | | id | `(str)` 案件唯一标识符 | | sources | `(List[str])` 从源文档中提取的文本字符串列表 | | summary/long | `(str)` 本案件的长摘要，采用多段落格式 | | summary/short | `(Optional[str])` 本案件的短摘要，采用单段落格式 | | summary/tiny | `(Optional[str])` 本案件的极简摘要，采用单句格式 | 请参考以下示例代码加载数据集： python from datasets import load_dataset multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518") # 下载multi_lexsum数据集并加载为Dataset对象 example = multi_lexsum["validation"][0] # 获取开发集的第一个实例 example["sources"] # 获取该案件的源文档文本列表 for sum_len in ["long", "short", "tiny"]: print(example["summary/" + sum_len]) # 打印三种不同长度的摘要 print(example['case_metadata']) # 打印该案件对应的元数据字典 ### 数据划分 | | 实例总数 | 源文档数(D) | 长摘要数(L) | 短摘要数(S) | 极简摘要数(T) | 总摘要数 | | ----------: | -------: | ----------: | ----------: | ----------: | ------------: | -------: | | 训练集（70%） | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 | | 测试集（20%） | 908 | 7,428 | 908 | 616 | 312 | 1,836 | | 开发集（10%） | 454 | 4,134 | 454 | 312 | 161 | 927 | ## 数据集说明文档（Datasheet）请访问[数据集说明文档](https://multilexsum.github.io/datasheet)，了解数据集构建流程、源数据来源、标注规范及使用注意事项的详细信息。 ## 附加信息 ### 数据集创建团队本数据集由密歇根大学民权诉讼清算所（Civil Rights Litigation Clearinghouse, CRLC）与艾伦人工智能研究所（Allen Institute for AI）合作构建。Multi-LexSum基于清算所此前发布的、用于向公众普及民权诉讼相关信息的数据集构建而成。 ### 许可信息 Multi-LexSum数据集整体采用[开放数据共同体署名许可协议（Open Data Commons Attribution License, ODC-By v1.0）](https://opendatacommons.org/licenses/by/1-0/)进行分发。案件摘要及元数据采用[知识共享署名-非商业性使用许可协议（Creative Commons Attribution-NonCommercial, CC BY-NC 4.0）](https://creativecommons.org/licenses/by-nc/4.0/)进行授权，源文档均已进入公共领域。如需使用摘要及元数据开展商业用途的用户，请联系[info@clearinghouse.net](mailto:info@clearinghouse.net)，该授权可免费使用但限制摘要的二次发布。本数据集的下载及加载代码采用Apache License 2.0协议进行授权。 ### 引用信息 @article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} } ## 发布历史 | 版本号 | 描述 | | ----------: | -----------------------------------------------------------: | | `v20230518` | v1.1版本，新增案件及源文档元数据字段 | | `v20220616` | 初始v1.0正式版本 |

应用场景：