rcds/MultiLegalNeg
收藏Hugging Face2023-10-25 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/rcds/MultiLegalNeg
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nd-4.0
viewer: true
task_categories:
- token-classification
tags:
- legal
pretty_name: Multilingual Negation Scope Resolution
size_categories:
- 1K<n<10K
---
# Dataset Card for MultiLegalNeg
### Dataset Summary
This dataset consists of German, French, and Italian court documents annotated for negation cues and negation scopes. It also includes a reformated version of ConanDoyle-neg ([
Morante and Blanco. 2012](https://aclanthology.org/S12-1035/)), SFU Review ([Konstantinova et al. 2012](http://www.lrec-conf.org/proceedings/lrec2012/pdf/533_Paper.pdf)), BioScope ([Szarvas et al. 2008](https://aclanthology.org/W08-0606/)) and Dalloux ([Dalloux et al. 2020](https://clementdalloux.fr/?page_id=28)).
### Languages
| Language | Subset | Number of sentences | Negated sentences |
|----------------------|-----------------|----------------------|-------------------|
| French | **fr** | 1059 | 382 |
| Italian | **it** | 1001 | 418 |
| German(Germany) | **de(DE)** | 1068 | 1098 |
| German (Switzerland) | **de(CH)** | 206 | 208 |
| English | **SFU Review** | 17672 | 3528 |
| English | **BioScope** | 14700 | 2095 |
| English | **ConanDoyle-neg**| 5714 | 5714 |
| French | **Dalloux** | 11032 | 1817 |
## Dataset Structure
### Data Fields
- text (string): full sentence
- spans (list): list of annotated cues and scopes
- start (int): offset of the beginning of the annotation
- end (int): offset of the end of the annotation
- token_start(int): id of the first token in the annotation
- token_end(int): id of the last token in the annotation
- label (string): CUE or SCOPE
- tokens (list): list of tokens in the sentence
- text (string): token text
- start (int): offset of the first character
- end (int): offset of the last character
- id (int): token id
- ws (boolean): indicates if the token is followed by a white space
### Data Splits
For each subset a train (70%), test (20%), and validation (10%) split is available.
#### How to use this dataset
To load all data use ```'all_all'```, or specify which dataset to load as the second argument. The available configurations are
```'de', 'fr', 'it', 'swiss', 'fr_dalloux', 'fr_all', 'en_bioscope', 'en_sherlock', 'en_sfu', 'en_all', 'all_all'```
```
from datasets import load_dataset
dataset = load_dataset("rcds/MultiLegalNeg", "all_all")
dataset
```
```
DatasetDict({
train: Dataset({
features: ['text', 'spans', 'tokens'],
num_rows: 26440
})
test: Dataset({
features: ['text', 'spans', 'tokens'],
num_rows: 7593
})
validation: Dataset({
features: ['text', 'spans', 'tokens'],
num_rows: 4053
})
})
```
### Source Data
| Subset | Source |
|-------------------|----------------------|
| **fr** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/), [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069) |
| **it** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/), [Niklaus et al. 2023](https://arxiv.org/abs/2306.02069) |
| **de(DE)** | [Glaser et al. 2021](https://www.scitepress.org/Link.aspx?doi=10.5220/0010246308120821) |
| **de(CH)** | [Niklaus et al. 2021](https://aclanthology.org/2021.nllp-1.3/) |
| **SFU Review** | [Konstantinova et al. 2012](http://www.lrec-conf.org/proceedings/lrec2012/pdf/533_Paper.pdf) |
| **BioScope** | [Szarvas et al. 2008](https://aclanthology.org/W08-0606/) |
| **ConanDoyle-neg**| [Morante and Blanco. 2012](https://aclanthology.org/S12-1035/) |
| **Dalloux** | [Dalloux et al. 2020](https://clementdalloux.fr/?page_id=28) |
### Annotations
The data is annotated for negation cues and their scopes. Annotation guidelines are available [here](https://github.com/RamonaChristen/Multilingual_Negation_Scope_Resolution_on_Legal_Data/blob/main/Annotation_Guidelines.pdf)
#### Annotation process
Each language was annotated by one native speaking annotator and follows strict annotation guidelines
### Citation Information
Please cite the following preprint:
```
@misc{christen2023resolving,
title={Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents},
author={Ramona Christen and Anastassia Shaitarova and Matthias Stürmer and Joel Niklaus},
year={2023},
eprint={2309.08695},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
rcds
原始信息汇总
数据集卡片 MultiLegalNeg
数据集概述
该数据集包含德语、法语和意大利语的法庭文档,标注了否定提示和否定范围。此外,还包括了以下数据集的重新格式化版本:
- ConanDoyle-neg (Morante and Blanco. 2012)
- SFU Review (Konstantinova et al. 2012)
- BioScope (Szarvas et al. 2008)
- Dalloux (Dalloux et al. 2020)
语言
| 语言 | 子集 | 句子数量 | 否定句子数量 |
|---|---|---|---|
| 法语 | fr | 1059 | 382 |
| 意大利语 | it | 1001 | 418 |
| 德语(德国) | de(DE) | 1068 | 1098 |
| 德语(瑞士) | de(CH) | 206 | 208 |
| 英语 | SFU Review | 17672 | 3528 |
| 英语 | BioScope | 14700 | 2095 |
| 英语 | ConanDoyle-neg | 5714 | 5714 |
| 法语 | Dalloux | 11032 | 1817 |
数据集结构
数据字段
text(字符串): 完整句子spans(列表): 标注的提示和范围列表start(整数): 标注开始位置的偏移量end(整数): 标注结束位置的偏移量token_start(整数): 标注中第一个词元的IDtoken_end(整数): 标注中最后一个词元的IDlabel(字符串): CUE 或 SCOPE
tokens(列表): 句子中的词元列表text(字符串): 词元文本start(整数): 第一个字符的偏移量end(整数): 最后一个字符的偏移量id(整数): 词元IDws(布尔值): 指示词元后是否有空格
数据分割
每个子集都有训练集(70%)、测试集(20%)和验证集(10%)。
如何使用该数据集
可以使用以下配置加载数据集:
de,fr,it,swiss,fr_dalloux,fr_all,en_bioscope,en_sherlock,en_sfu,en_all,all_all
python from datasets import load_dataset
dataset = load_dataset("rcds/MultiLegalNeg", "all_all")
来源数据
| 子集 | 来源 |
|---|---|
| fr | Niklaus et al. 2021, 2023 |
| it | Niklaus et al. 2021, 2023 |
| de(DE) | Glaser et al. 2021 |
| de(CH) | Niklaus et al. 2021 |
| SFU Review | Konstantinova et al. 2012 |
| BioScope | Szarvas et al. 2008 |
| ConanDoyle-neg | Morante and Blanco. 2012 |
| Dalloux | Dalloux et al. 2020 |
标注
数据集标注了否定提示及其范围。标注指南可在此处获取:Annotation Guidelines
标注过程
每种语言由一位母语标注者进行标注,并遵循严格的标注指南。
引用信息
请引用以下预印本:
@misc{christen2023resolving, title={Resolving Legalese: A Multilingual Exploration of Negation Scope Resolution in Legal Documents}, author={Ramona Christen and Anastassia Shaitarova and Matthias Stürmer and Joel Niklaus}, year={2023}, eprint={2309.08695}, archivePrefix={arXiv}, primaryClass={cs.CL} }



