five

allenai/multi_lexsum

收藏
Hugging Face2023-05-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/allenai/multi_lexsum
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language: - en language_creators: - found license: - odc-by multilinguality: - monolingual pretty_name: Multi-LexSum size_categories: - 1K<n<10K - 10K<n<100K source_datasets: - original tags: [] task_categories: - summarization task_ids: [] --- # Dataset Card for Multi-LexSum ## Table of Contents - [Dataset Card for Multi-LexSum](#dataset-card-for-multi-lexsum) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset](#dataset) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Sheet (Datasheet)](#dataset-sheet-datasheet) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Release History](#release-history) ## Dataset Description - **Homepage:** https://multilexsum.github.io - **Repository:** https://github.com/multilexsum/dataset - **Paper:** https://arxiv.org/abs/2206.10883 <p> <a href="https://multilexsum.github.io" style="display: inline-block;"> <img src="https://img.shields.io/badge/-homepage-informational.svg?logo=jekyll" title="Multi-LexSum Paper" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://github.com/multilexsum/dataset" style="display: inline-block;"> <img src="https://img.shields.io/badge/-multilexsum-lightgrey.svg?logo=github" title="Multi-LexSum Github Repo" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> <a href="https://arxiv.org/abs/2206.10883" style="display: inline-block;"> <img src="https://img.shields.io/badge/NeurIPS-2022-9cf" title="Multi-LexSum is accepted in NeurIPS 2022" style="margin-top: 0.25rem; margin-bottom: 0.25rem"></a> </p> ### Talk @ NeurIPS 2022 [![Watch the video](https://img.youtube.com/vi/C-fwW_ZhkE8/0.jpg)](https://youtu.be/C-fwW_ZhkE8) ### Dataset Summary The Multi-LexSum dataset is a collection of 9,280 such legal case summaries. Multi-LexSum is distinct from other datasets in its **multiple target summaries, each at a different granularity** (ranging from one-sentence “extreme” summaries to multi-paragraph narrations of over five hundred words). It presents a challenging multi-document summarization task given **the long length of the source documents**, often exceeding two hundred pages per case. Unlike other summarization datasets that are (semi-)automatically curated, Multi-LexSum consists of **expert-authored summaries**: the experts—lawyers and law students—are trained to follow carefully created guidelines, and their work is reviewed by an additional expert to ensure quality. ### Languages English ## Dataset ### Data Fields The dataset contains a list of instances (cases); each instance contains the following data: | Field | Description | | ------------: | -------------------------------------------------------------------------------: | | id | `(str)` The case ID | | sources | `(List[str])` A list of strings for the text extracted from the source documents | | summary/long | `(str)` The long (multi-paragraph) summary for this case | | summary/short | `(Optional[str])` The short (one-paragraph) summary for this case | | summary/tiny | `(Optional[str])` The tiny (one-sentence) summary for this case | Please check the exemplar usage below for loading the data: ```python from datasets import load_dataset multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518") # Download multi_lexsum locally and load it as a Dataset object example = multi_lexsum["validation"][0] # The first instance of the dev set example["sources"] # A list of source document text for the case for sum_len in ["long", "short", "tiny"]: print(example["summary/" + sum_len]) # Summaries of three lengths print(example['case_metadata']) # The corresponding metadata for a case in a dict ``` ### Data Splits | | Instances | Source Documents (D) | Long Summaries (L) | Short Summaries (S) | Tiny Summaries (T) | Total Summaries | | ----------: | --------: | -------------------: | -----------------: | ------------------: | -----------------: | --------------: | | Train (70%) | 3,177 | 28,557 | 3,177 | 2,210 | 1,130 | 6,517 | | Test (20%) | 908 | 7,428 | 908 | 616 | 312 | 1,836 | | Dev (10%) | 454 | 4,134 | 454 | 312 | 161 | 927 | ## Dataset Sheet (Datasheet) Please check our [dataset sheet](https://multilexsum.github.io/datasheet) for details regarding dataset creation, source data, annotation, and considerations for the usage. ## Additional Information ### Dataset Curators The dataset is created by the collaboration between Civil Rights Litigation Clearinghouse (CRLC, from University of Michigan) and Allen Institute for AI. Multi-LexSum builds on the dataset used and posted by the Clearinghouse to inform the public about civil rights litigation. ### Licensing Information The Multi-LexSum dataset is distributed under the [Open Data Commons Attribution License (ODC-By)](https://opendatacommons.org/licenses/by/1-0/). The case summaries and metadata are licensed under the [Creative Commons Attribution License (CC BY-NC)](https://creativecommons.org/licenses/by-nc/4.0/), and the source documents are already in the public domain. Commercial users who desire a license for summaries and metadata can contact [info@clearinghouse.net](mailto:info@clearinghouse.net), which will allow free use but limit summary re-posting. The corresponding code for downloading and loading the dataset is licensed under the Apache License 2.0. ### Citation Information ``` @article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022},**** url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} } ``` ## Release History | Version | Description | | ----------: | -----------------------------------------------------------: | | `v20230518` | The v1.1 release including case and source document metadata | | `v20220616` | The initial v1.0 release |
提供机构:
allenai
原始信息汇总

数据集概述

数据集名称

  • Multi-LexSum

语言

  • 英语

数据集大小

  • 1K<n<10K
  • 10K<n<100K

数据集来源

  • 原始数据

任务类别

  • 摘要生成

数据集内容

  • 数据字段

    • id:案件ID,类型为字符串。
    • sources:源文档文本列表,类型为字符串列表。
    • summary/long:长(多段落)摘要,类型为字符串。
    • summary/short:短(一段)摘要,类型为可选字符串。
    • summary/tiny:极短(一句)摘要,类型为可选字符串。
  • 数据分割

    • 训练集:3,177个实例,3,177个长摘要,2,210个短摘要,1,130个极短摘要。
    • 测试集:908个实例,908个长摘要,616个短摘要,312个极短摘要。
    • 验证集:454个实例,454个长摘要,312个短摘要,161个极短摘要。

许可证信息

数据集创建者

  • Civil Rights Litigation Clearinghouse (CRLC, University of Michigan)Allen Institute for AI

引用信息

@article{Shen2022MultiLexSum, author = {Zejiang Shen and Kyle Lo and Lauren Yu and Nathan Dahlberg and Margo Schlanger and Doug Downey}, title = {Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities}, journal = {CoRR}, volume = {abs/2206.10883}, year = {2022}, url = {https://doi.org/10.48550/arXiv.2206.10883}, doi = {10.48550/arXiv.2206.10883} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作