allenai/multi_lexsum

Name: allenai/multi_lexsum
Creator: allenai
Published: 2023-05-18 21:41:22
License: 暂无描述

Hugging Face2023-05-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/allenai/multi_lexsum

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - en language_creators: - found license: - odc-by multilinguality: - monolingual pretty_name: Multi-LexSum size_categories: - 1K<n<10K - 10K<n<100K source_datasets: - original tags: [] task_categories: - summarization task_ids: [] --- # Dataset Card for Multi-LexSum ## Table of Contents - [Dataset Card for Multi-LexSum](#dataset-card-for-multi-lexsum) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset](#dataset) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Sheet (Datasheet)](#dataset-sheet-datasheet) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Release History](#release-history) ## Dataset Description - **Homepage:** https://multilexsum.github.io - **Repository:** https://github.com/multilexsum/dataset - **Paper:** https://arxiv.org/abs/2206.10883 <p> <a href="https://multilexsum.github.io" style="display: inline-block;"> <img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum Paper" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://github.com/multilexsum/dataset" style="display: inline-block;"> <img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum Github Repo" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;"> <img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="Multi-LexSum is accepted in NeurIPS 2022" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> </p> ### Talk @ NeurIPS 2022 [![Watch the video](https://img.youtube.com/vi/C-fwW_ZhkE8/0.jpg)](https://youtu.be/C-fwW_ZhkE8) ### Dataset Summary The Multi-LexSum dataset is a collection of 9,280 such legal case summaries. Multi-LexSum is distinct from other datasets in its **multiple target summaries, each at a different granularity** (ranging from one-sentence “extreme” summaries to multi-paragraph narrations of over five hundred words). It presents a challenging multi-document summarization task given **the long length of the source documents**, often exceeding two hundred pages per case. Unlike other summarization datasets that are (semi-)automatically curated, Multi-LexSum consists of **expert-authored summaries**: the experts—lawyers and law students—are trained to follow carefully created guidelines, and their work is reviewed by an additional expert to ensure quality. ### Languages English ## Dataset ### Data Fields The dataset contains a list of instances (cases); each instance contains the following data: | Field | Description | | ------------: | -------------------------------------------------------------------------------: | | id | `(str)` The case ID | | sources | `(List[str])` A list of strings for the text extracted from the source documents | | summary/long | `(str)` The long (multi-paragraph) summary for this case | | summary/short | `(Optional[str])` The short (one-paragraph) summary for this case | | summary/tiny | `(Optional[str])` The tiny (one-sentence) summary for this case | Please check the exemplar usage below for loading the data: ```python from datasets import load_dataset multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518") # Download multi_lexsum locally and load it as a Dataset object example = multi_lexsum["validation"][0] # The first instance of the dev set example["sources"] # A list of source document text for the case for sum_len in ["long", "short", "tiny"]: print(example["summary/" + sum_len]) # Summaries of three lengths print(example['case_metadata']) # The corresponding metadata for a case in a dict ``` ### Data Splits | | Instances | Source Documents (D) | Long Summaries (L) | Short Summaries (S) | Tiny Summaries (T) | Total Summaries | | ----------: | --------: | -------------------: | -----------------: | ------------------: | -----------------: | --------------: | | Train (70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 | | Test (20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 | | Dev (10%) | 454 | 4,134 | 454 | 312 | 161 | 927 | ## Dataset Sheet (Datasheet) Please check our [dataset sheet](https://multilexsum.github.io/datasheet) for details regarding dataset creation, source data, annotation, and considerations for the usage. ## Additional Information ### Dataset Curators The dataset is created by the collaboration between Civil Rights Litigation Clearinghouse (CRLC, from University of Michigan) and Allen Institute for AI. Multi-LexSum builds on the dataset used and posted by the Clearinghouse to inform the public about civil rights litigation. ### Licensing Information The Multi-LexSum dataset is distributed under the [Open Data Commons Attribution License (ODC-By)](https://opendatacommons.org/licenses/by/1-0/). The case summaries and metadata are licensed under the [Creative Commons Attribution License (CC BY-NC)](https://creativecommons.org/licenses/by-nc/4.0/), and the source documents are already in the public domain. Commercial users who desire a license for summaries and metadata can contact [info@clearinghouse.net](mailto:info@clearinghouse.net), which will allow free use but limit summary re-posting. The corresponding code for downloading and loading the dataset is licensed under the Apache License 2.0. ### Citation Information ``` @article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022},**** url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} } ``` ## Release History | Version | Description | | ----------: | -----------------------------------------------------------: | | `v20230518` | The v1.1 release including case and source document metadata | | `v20220616` | The initial v1.0 release |

提供机构：

allenai

原始信息汇总

数据集概述

数据集名称

Multi-LexSum

语言

英语

数据集大小

1K<n<10K
10K<n<100K

数据集来源

原始数据

任务类别

摘要生成

数据集内容

数据字段
- id：案件ID，类型为字符串。
- sources：源文档文本列表，类型为字符串列表。
- summary/long：长（多段落）摘要，类型为字符串。
- summary/short：短（一段）摘要，类型为可选字符串。
- summary/tiny：极短（一句）摘要，类型为可选字符串。
数据分割
- 训练集：3,177个实例，3,177个长摘要，2,210个短摘要，1,130个极短摘要。
- 测试集：908个实例，908个长摘要，616个短摘要，312个极短摘要。
- 验证集：454个实例，454个长摘要，312个短摘要，161个极短摘要。

许可证信息

数据集：Open Data Commons Attribution License (ODC-By)。
案例摘要和元数据：Creative Commons Attribution License (CC BY-NC)。
源文档：公共领域。

数据集创建者

Civil Rights Litigation Clearinghouse (CRLC, University of Michigan) 和 Allen Institute for AI。

引用信息

@article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集