rcds/MultiLegalNeg

Name: rcds/MultiLegalNeg
Creator: rcds
Published: 2023-10-25 17:59:53
License: 暂无描述

Hugging Face2023-10-25 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/rcds/MultiLegalNeg

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nd-4.0 viewer: true task_categories: - token-classification tags: - legal pretty_name: Multilingual Negation Scope Resolution size_categories: - 1K<n<10K --- # Dataset Card for MultiLegalNeg ### Dataset Summary This dataset consists of German, French, and Italian court documents annotated for negation cues and negation scopes. It also includes a reformated version of ConanDoyle-neg ([ Morante and Blanco. 2012](https://aclanthology.org/S12-1035/)), SFU Review ([Konstantinova et al. 2012](http://www.lrec-conf.org/proceedings/lrec2012/pdf/533_Paper.pdf)), BioScope ([Szarvas et al. 2008](https://aclanthology.org/W08-0606/)) and Dalloux ([Dalloux et al. 2020](https://clementdalloux.fr/?page_id=28)). ### Languages | Language | Subset | Number of sentences | Negated sentences | |----------------------|-----------------|----------------------|-------------------| | French | **fr** | 1059 | 382 | | Italian | **it** | 1001 | 418 | | German(Germany) | **de(DE)** | 1068 | 1098 | | German (Switzerland) | **de(CH)** | 206 | 208 | | English | **SFU Review** | 17672 | 3528 | | English | **BioScope** | 14700 | 2095 | | English | **ConanDoyle-neg**| 5714 | 5714 | | French | **Dalloux** | 11032 | 1817 | ## Dataset Structure ### Data Fields - text (string): full sentence - spans (list): list of annotated cues and scopes - start (int): offset of the beginning of the annotation - end (int): offset of the end of the annotation - token_start(int): id of the first token in the annotation - token_end(int): id of the last token in the annotation - label (string): CUE or SCOPE - tokens (list): list of tokens in the sentence - text (string): token text - start (int): offset of the first character - end (int): offset of the last character - id (int): token id - ws (boolean): indicates if the token is followed by a white space ### Data Splits For each subset a train (70%), test (20%), and validation (10%) split is available. #### How to use this dataset To load all data use ```'all_all'```, or specify which dataset to load as the second argument. The available configurations are ```'de', 'fr', 'it', 'swiss', 'fr_dalloux', 'fr_all', 'en_bioscope', 'en_sherlock', 'en_sfu', 'en_all', 'all_all'``` ``` from datasets import load_dataset dataset = load_dataset("rcds/MultiLegalNeg", "all_all") dataset ``` ``` DatasetDict({ train: Dataset({ features: ['text', 'spans', 'tokens'], num_rows: 26440 }) test: Dataset({ features: ['text', 'spans', 'tokens'], num_rows: 7593 }) validation: Dataset({ features: ['text', 'spans', 'tokens'], num_rows: 4053 }) }) ``` ### Source Data | Subset | Source | |-------------------|----------------------| | **fr** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/), [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069) | | **it** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/), [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069) | | **de(DE)** | [Glaser et al. 2021](https://www.scitepress.org/Link.aspx?doi=10.5220/0010246308120821) | | **de(CH)** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/) | | **SFU Review** | [Konstantinova et al. 2012](http://www.lrec-conf.org/proceedings/lrec2012/pdf/533_Paper.pdf) | | **BioScope** | [Szarvas et al. 2008](https://aclanthology.org/W08-0606/) | | **ConanDoyle-neg**| [Morante and Blanco. 2012](https://aclanthology.org/S12-1035/) | | **Dalloux** | [Dalloux et al. 2020](https://clementdalloux.fr/?page_id=28) | ### Annotations The data is annotated for negation cues and their scopes. Annotation guidelines are available [here](https://github.com/RamonaChristen/Multilingual_Negation_Scope_Resolution_on_Legal_Data/blob/main/Annotation_Guidelines.pdf) #### Annotation process Each language was annotated by one native speaking annotator and follows strict annotation guidelines ### Citation Information Please cite the following preprint: ``` @misc{christen2023resolving, title={Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents}, author={Ramona Christen and Anastassia Shaitarova and Matthias Stürmer and Joel Niklaus}, year={2023}, eprint={2309.08695}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

rcds

原始信息汇总

数据集卡片 MultiLegalNeg

数据集概述

该数据集包含德语、法语和意大利语的法庭文档，标注了否定提示和否定范围。此外，还包括了以下数据集的重新格式化版本：

ConanDoyle-neg (Morante and Blanco. 2012)
SFU Review (Konstantinova et al. 2012)
BioScope (Szarvas et al. 2008)
Dalloux (Dalloux et al. 2020)

语言

语言	子集	句子数量	否定句子数量
法语	fr	1059	382
意大利语	it	1001	418
德语（德国）	de(DE)	1068	1098
德语（瑞士）	de(CH)	206	208
英语	SFU Review	17672	3528
英语	BioScope	14700	2095
英语	ConanDoyle-neg	5714	5714
法语	Dalloux	11032	1817

数据集结构

数据字段

text (字符串): 完整句子
spans (列表): 标注的提示和范围列表
- start (整数): 标注开始位置的偏移量
- end (整数): 标注结束位置的偏移量
- token_start (整数): 标注中第一个词元的ID
- token_end (整数): 标注中最后一个词元的ID
- label (字符串): CUE 或 SCOPE
tokens (列表): 句子中的词元列表
- text (字符串): 词元文本
- start (整数): 第一个字符的偏移量
- end (整数): 最后一个字符的偏移量
- id (整数): 词元ID
- ws (布尔值): 指示词元后是否有空格

数据分割

每个子集都有训练集（70%）、测试集（20%）和验证集（10%）。

如何使用该数据集

可以使用以下配置加载数据集：

de, fr, it, swiss, fr_dalloux, fr_all, en_bioscope, en_sherlock, en_sfu, en_all, all_all

python from datasets import load_dataset

dataset = load_dataset("rcds/MultiLegalNeg", "all_all")

来源数据

子集	来源
fr	Niklaus et al. 2021, 2023
it	Niklaus et al. 2021, 2023
de(DE)	Glaser et al. 2021
de(CH)	Niklaus et al. 2021
SFU Review	Konstantinova et al. 2012
BioScope	Szarvas et al. 2008
ConanDoyle-neg	Morante and Blanco. 2012
Dalloux	Dalloux et al. 2020

标注

数据集标注了否定提示及其范围。标注指南可在此处获取：Annotation Guidelines

标注过程

每种语言由一位母语标注者进行标注，并遵循严格的标注指南。

引用信息

请引用以下预印本：

@misc{christen2023resolving, title={Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents}, author={Ramona Christen and Anastassia Shaitarova and Matthias Stürmer and Joel Niklaus}, year={2023}, eprint={2309.08695}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集