five

rcds/MultiLegalNeg

收藏
Hugging Face2023-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rcds/MultiLegalNeg
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nd-4.0 viewer: true task_categories: - token-classification tags: - legal pretty_name: Multilingual Negation Scope Resolution size_categories: - 1K<n<10K --- # Dataset Card for MultiLegalNeg ### Dataset Summary This dataset consists of German, French, and Italian court documents annotated for negation cues and negation scopes. It also includes a reformated version of ConanDoyle-neg ([ Morante and Blanco. 2012](https://aclanthology.org/S12-1035/)), SFU Review ([Konstantinova et al. 2012](http://www.lrec-conf.org/proceedings/lrec2012/pdf/533_Paper.pdf)), BioScope ([Szarvas et al. 2008](https://aclanthology.org/W08-0606/)) and Dalloux ([Dalloux et al. 2020](https://clementdalloux.fr/?page_id=28)). ### Languages | Language | Subset | Number of sentences | Negated sentences | |----------------------|-----------------|----------------------|-------------------| | French | **fr** | 1059 | 382 | | Italian | **it** | 1001 | 418 | | German(Germany) | **de(DE)** | 1068 | 1098 | | German (Switzerland) | **de(CH)** | 206 | 208 | | English | **SFU Review** | 17672 | 3528 | | English | **BioScope** | 14700 | 2095 | | English | **ConanDoyle-neg**| 5714 | 5714 | | French | **Dalloux** | 11032 | 1817 | ## Dataset Structure ### Data Fields - text (string): full sentence - spans (list): list of annotated cues and scopes - start (int): offset of the beginning of the annotation - end (int): offset of the end of the annotation - token_start(int): id of the first token in the annotation - token_end(int): id of the last token in the annotation - label (string): CUE or SCOPE - tokens (list): list of tokens in the sentence - text (string): token text - start (int): offset of the first character - end (int): offset of the last character - id (int): token id - ws (boolean): indicates if the token is followed by a white space ### Data Splits For each subset a train (70%), test (20%), and validation (10%) split is available. #### How to use this dataset To load all data use ```'all_all'```, or specify which dataset to load as the second argument. The available configurations are ```'de', 'fr', 'it', 'swiss', 'fr_dalloux', 'fr_all', 'en_bioscope', 'en_sherlock', 'en_sfu', 'en_all', 'all_all'``` ``` from datasets import load_dataset dataset = load_dataset("rcds/MultiLegalNeg", "all_all") dataset ``` ``` DatasetDict({ train: Dataset({ features: ['text', 'spans', 'tokens'], num_rows: 26440 }) test: Dataset({ features: ['text', 'spans', 'tokens'], num_rows: 7593 }) validation: Dataset({ features: ['text', 'spans', 'tokens'], num_rows: 4053 }) }) ``` ### Source Data | Subset | Source | |-------------------|----------------------| | **fr** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/), [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069) | | **it** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/), [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069) | | **de(DE)** | [Glaser et al. 2021](https://www.scitepress.org/Link.aspx?doi=10.5220/0010246308120821) | | **de(CH)** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/) | | **SFU Review** | [Konstantinova et al. 2012](http://www.lrec-conf.org/proceedings/lrec2012/pdf/533_Paper.pdf) | | **BioScope** | [Szarvas et al. 2008](https://aclanthology.org/W08-0606/) | | **ConanDoyle-neg**| [Morante and Blanco. 2012](https://aclanthology.org/S12-1035/) | | **Dalloux** | [Dalloux et al. 2020](https://clementdalloux.fr/?page_id=28) | ### Annotations The data is annotated for negation cues and their scopes. Annotation guidelines are available [here](https://github.com/RamonaChristen/Multilingual_Negation_Scope_Resolution_on_Legal_Data/blob/main/Annotation_Guidelines.pdf) #### Annotation process Each language was annotated by one native speaking annotator and follows strict annotation guidelines ### Citation Information Please cite the following preprint: ``` @misc{christen2023resolving, title={Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents}, author={Ramona Christen and Anastassia Shaitarova and Matthias Stürmer and Joel Niklaus}, year={2023}, eprint={2309.08695}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
rcds
原始信息汇总

数据集卡片 MultiLegalNeg

数据集概述

该数据集包含德语、法语和意大利语的法庭文档,标注了否定提示和否定范围。此外,还包括了以下数据集的重新格式化版本:

  • ConanDoyle-neg (Morante and Blanco. 2012)
  • SFU Review (Konstantinova et al. 2012)
  • BioScope (Szarvas et al. 2008)
  • Dalloux (Dalloux et al. 2020)

语言

语言 子集 句子数量 否定句子数量
法语 fr 1059 382
意大利语 it 1001 418
德语(德国) de(DE) 1068 1098
德语(瑞士) de(CH) 206 208
英语 SFU Review 17672 3528
英语 BioScope 14700 2095
英语 ConanDoyle-neg 5714 5714
法语 Dalloux 11032 1817

数据集结构

数据字段

  • text (字符串): 完整句子
  • spans (列表): 标注的提示和范围列表
    • start (整数): 标注开始位置的偏移量
    • end (整数): 标注结束位置的偏移量
    • token_start (整数): 标注中第一个词元的ID
    • token_end (整数): 标注中最后一个词元的ID
    • label (字符串): CUE 或 SCOPE
  • tokens (列表): 句子中的词元列表
    • text (字符串): 词元文本
    • start (整数): 第一个字符的偏移量
    • end (整数): 最后一个字符的偏移量
    • id (整数): 词元ID
    • ws (布尔值): 指示词元后是否有空格

数据分割

每个子集都有训练集(70%)、测试集(20%)和验证集(10%)。

如何使用该数据集

可以使用以下配置加载数据集:

  • de, fr, it, swiss, fr_dalloux, fr_all, en_bioscope, en_sherlock, en_sfu, en_all, all_all

python from datasets import load_dataset

dataset = load_dataset("rcds/MultiLegalNeg", "all_all")

来源数据

子集 来源
fr Niklaus et al. 2021, 2023
it Niklaus et al. 2021, 2023
de(DE) Glaser et al. 2021
de(CH) Niklaus et al. 2021
SFU Review Konstantinova et al. 2012
BioScope Szarvas et al. 2008
ConanDoyle-neg Morante and Blanco. 2012
Dalloux Dalloux et al. 2020

标注

数据集标注了否定提示及其范围。标注指南可在此处获取:Annotation Guidelines

标注过程

每种语言由一位母语标注者进行标注,并遵循严格的标注指南。

引用信息

请引用以下预印本:

@misc{christen2023resolving, title={Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents}, author={Ramona Christen and Anastassia Shaitarova and Matthias Stürmer and Joel Niklaus}, year={2023}, eprint={2309.08695}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作