HiTZ/multilingual-abstrct
收藏Hugging Face2024-04-12 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/HiTZ/multilingual-abstrct
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: en
features:
- name: id
dtype: int64
- name: tokens
sequence: string
- name: labels_txt
sequence: string
- name: labels
sequence: int64
splits:
- name: neoplasm_train
num_bytes: 3140715
num_examples: 4404
- name: neoplasm_dev
num_bytes: 476131
num_examples: 679
- name: neoplasm_test
num_bytes: 893795
num_examples: 1251
- name: glaucoma_test
num_bytes: 821598
num_examples: 1247
- name: mixed_test
num_bytes: 847284
num_examples: 1147
download_size: 787800
dataset_size: 6179523
- config_name: es
features:
- name: id
dtype: int64
- name: tokens
sequence: string
- name: labels_txt
sequence: string
- name: labels
sequence: int64
splits:
- name: neoplasm_train
num_bytes: 3409630
num_examples: 4404
- name: neoplasm_dev
num_bytes: 508674
num_examples: 679
- name: neoplasm_test
num_bytes: 959509
num_examples: 1251
- name: glaucoma_test
num_bytes: 884585
num_examples: 1247
- name: mixed_test
num_bytes: 906728
num_examples: 1147
download_size: 910927
dataset_size: 6669126
- config_name: fr
features:
- name: id
dtype: int64
- name: tokens
sequence: string
- name: labels_txt
sequence: string
- name: labels
sequence: int64
splits:
- name: neoplasm_train
num_bytes: 3555470
num_examples: 4404
- name: neoplasm_dev
num_bytes: 537948
num_examples: 679
- name: neoplasm_test
num_bytes: 1011572
num_examples: 1251
- name: glaucoma_test
num_bytes: 912823
num_examples: 1247
- name: mixed_test
num_bytes: 946807
num_examples: 1147
download_size: 929512
dataset_size: 6964620
- config_name: it
features:
- name: id
dtype: int64
- name: tokens
sequence: string
- name: labels_txt
sequence: string
- name: labels
sequence: int64
splits:
- name: neoplasm_train
num_bytes: 3279617
num_examples: 4405
- name: neoplasm_dev
num_bytes: 495956
num_examples: 679
- name: neoplasm_test
num_bytes: 934068
num_examples: 1251
- name: glaucoma_test
num_bytes: 862835
num_examples: 1247
- name: mixed_test
num_bytes: 877966
num_examples: 1147
download_size: 897597
dataset_size: 6450442
configs:
- config_name: en
data_files:
- split: neoplasm_train
path: en/neoplasm_train-*
- split: neoplasm_dev
path: en/neoplasm_dev-*
- split: neoplasm_test
path: en/neoplasm_test-*
- split: glaucoma_test
path: en/glaucoma_test-*
- split: mixed_test
path: en/mixed_test-*
- config_name: es
data_files:
- split: neoplasm_train
path: es/neoplasm_train-*
- split: neoplasm_dev
path: es/neoplasm_dev-*
- split: neoplasm_test
path: es/neoplasm_test-*
- split: glaucoma_test
path: es/glaucoma_test-*
- split: mixed_test
path: es/mixed_test-*
- config_name: fr
data_files:
- split: neoplasm_train
path: fr/neoplasm_train-*
- split: neoplasm_dev
path: fr/neoplasm_dev-*
- split: neoplasm_test
path: fr/neoplasm_test-*
- split: glaucoma_test
path: fr/glaucoma_test-*
- split: mixed_test
path: fr/mixed_test-*
- config_name: it
data_files:
- split: neoplasm_train
path: it/neoplasm_train-*
- split: neoplasm_dev
path: it/neoplasm_dev-*
- split: neoplasm_test
path: it/neoplasm_test-*
- split: glaucoma_test
path: it/glaucoma_test-*
- split: mixed_test
path: it/mixed_test-*
license: cc-by-nc-sa-4.0
task_categories:
- token-classification
language:
- en
- es
- fr
- it
tags:
- biology
- medical
pretty_name: Multilingual AbstRCT
---
<p align="center">
<br>
<img src="http://www.ixa.eus/sites/default/files/anitdote.png" style="width: 30%;">
<h2 align="center">Mutilingual AbstRCT</h2>
<be>
We translate the [AbstRCT English Argument Mining Dataset](https://gitlab.com/tomaye/abstrct) dataset to generate parallel French, Italian and Spanish versions
using the NLLB200 3B parameter model and projected using word alignment tools. The projections have been manually corrected.
For more info about the original English AbstRCT dataset [read the original paper](https://hal.archives-ouvertes.fr/hal-03264761/file/2020Journal_AI_in_Medicine_ArgMiningClinicalTrials_forhal.pdf).
For the translation and projection data see [https://github.com/ragerri/abstrct-projections/tree/final](https://github.com/ragerri/abstrct-projections/tree/final).
- 📖 Paper: [Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain]()
- 🌐 Project Website: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote)
- Code: [https://github.com/ragerri/abstrct-projections/tree/final](https://github.com/ragerri/abstrct-projections/tree/final)
- Original Dataset: [https://gitlab.com/tomaye/abstrct](https://gitlab.com/tomaye/abstrct)
- Funding: CHIST-ERA XAI 2019 call. Antidote (PCI2020-120717-2) funded by MCIN/AEI /10.13039/501100011033 and by European Union NextGenerationEU/PRTR
## Labels
```python
{
"O": 0,
"B-Claim": 1,
"I-Claim": 2,
"B-Premise": 3,
"I-Premise": 4,
}
```
A `claim` is a concluding statement made by the author about the outcome of the study. In the medical domain it may be an assertion of a diagnosis or a treatment. A `premise` corresponds to an observation or measurement in the study (ground truth), which supports or attacks another argument component, usually a claim. It is important that they are observed facts, therefore, credible without further evidence.
## Citation
If you use the **original English AbstRCT** please cite the following paper:
```bibtext
@article{mayer2021enhancing,
title={Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials},
author={Mayer, Tobias and Marro, Santiago and Cabrio, Elena and Villata, Serena},
journal={Artificial Intelligence in Medicine},
volume={118},
pages={102098},
year={2021},
publisher={Elsevier}
}
```
If you use the **French, Italian and Spanish** versions then add the following reference:
```bibtex
@misc{garcíaferrero2024medical,
title={Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain},
author={Iker García-Ferrero and Rodrigo Agerri and Aitziber Atutxa Salazar and Elena Cabrio and Iker de la Iglesia and Alberto Lavelli and Bernardo Magnini and Benjamin Molinet and Johana Ramirez-Romero and German Rigau and Jose Maria Villa-Gonzalez and Serena Villata and Andrea Zaninello},
year={2024},
eprint={2404.07613},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
提供机构:
HiTZ
原始信息汇总
数据集概述
数据集配置
-
英文 (en)
- 特征:
id: 数据类型int64tokens: 序列类型stringlabels_txt: 序列类型stringlabels: 序列类型int64
- 分割:
neoplasm_train: 3140715 字节, 4404 示例neoplasm_dev: 476131 字节, 679 示例neoplasm_test: 893795 字节, 1251 示例glaucoma_test: 821598 字节, 1247 示例mixed_test: 847284 字节, 1147 示例
- 下载大小: 787800 字节
- 数据集大小: 6179523 字节
- 特征:
-
西班牙文 (es)
- 特征:
id: 数据类型int64tokens: 序列类型stringlabels_txt: 序列类型stringlabels: 序列类型int64
- 分割:
neoplasm_train: 3409630 字节, 4404 示例neoplasm_dev: 508674 字节, 679 示例neoplasm_test: 959509 字节, 1251 示例glaucoma_test: 884585 字节, 1247 示例mixed_test: 906728 字节, 1147 示例
- 下载大小: 910927 字节
- 数据集大小: 6669126 字节
- 特征:
-
法文 (fr)
- 特征:
id: 数据类型int64tokens: 序列类型stringlabels_txt: 序列类型stringlabels: 序列类型int64
- 分割:
neoplasm_train: 3555470 字节, 4404 示例neoplasm_dev: 537948 字节, 679 示例neoplasm_test: 1011572 字节, 1251 示例glaucoma_test: 912823 字节, 1247 示例mixed_test: 946807 字节, 1147 示例
- 下载大小: 929512 字节
- 数据集大小: 6964620 字节
- 特征:
-
意大利文 (it)
- 特征:
id: 数据类型int64tokens: 序列类型stringlabels_txt: 序列类型stringlabels: 序列类型int64
- 分割:
neoplasm_train: 3279617 字节, 4405 示例neoplasm_dev: 495956 字节, 679 示例neoplasm_test: 934068 字节, 1251 示例glaucoma_test: 862835 字节, 1247 示例mixed_test: 877966 字节, 1147 示例
- 下载大小: 897597 字节
- 数据集大小: 6450442 字节
- 特征:
数据文件路径
-
英文 (en)
neoplasm_train:en/neoplasm_train-*neoplasm_dev:en/neoplasm_dev-*neoplasm_test:en/neoplasm_test-*glaucoma_test:en/glaucoma_test-*mixed_test:en/mixed_test-*
-
西班牙文 (es)
neoplasm_train:es/neoplasm_train-*neoplasm_dev:es/neoplasm_dev-*neoplasm_test:es/neoplasm_test-*glaucoma_test:es/glaucoma_test-*mixed_test:es/mixed_test-*
-
法文 (fr)
neoplasm_train:fr/neoplasm_train-*neoplasm_dev:fr/neoplasm_dev-*neoplasm_test:fr/neoplasm_test-*glaucoma_test:fr/glaucoma_test-*mixed_test:fr/mixed_test-*
-
意大利文 (it)
neoplasm_train:it/neoplasm_train-*neoplasm_dev:it/neoplasm_dev-*neoplasm_test:it/neoplasm_test-*glaucoma_test:it/glaucoma_test-*mixed_test:it/mixed_test-*
许可
- cc-by-nc-sa-4.0
任务类别
- token-classification
语言
- 英文 (en)
- 西班牙文 (es)
- 法文 (fr)
- 意大利文 (it)
标签
python { "O": 0, "B-Claim": 1, "I-Claim": 2, "B-Premise": 3, "I-Premise": 4, }
claim: 作者关于研究结果的结论性陈述,在医学领域可能是诊断或治疗的断言。premise: 研究中的观察或测量(事实),支持或反驳其他论点组件,通常是主张。这些是观察到的事实,因此无需进一步证据即可信。



