HiTZ/multilingual-abstrct

Name: HiTZ/multilingual-abstrct
Creator: HiTZ
Published: 2024-04-12 14:49:20
License: 暂无描述

Hugging Face2024-04-12 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/HiTZ/multilingual-abstrct

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: en features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3140715 num_examples: 4404 - name: neoplasm_dev num_bytes: 476131 num_examples: 679 - name: neoplasm_test num_bytes: 893795 num_examples: 1251 - name: glaucoma_test num_bytes: 821598 num_examples: 1247 - name: mixed_test num_bytes: 847284 num_examples: 1147 download_size: 787800 dataset_size: 6179523 - config_name: es features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3409630 num_examples: 4404 - name: neoplasm_dev num_bytes: 508674 num_examples: 679 - name: neoplasm_test num_bytes: 959509 num_examples: 1251 - name: glaucoma_test num_bytes: 884585 num_examples: 1247 - name: mixed_test num_bytes: 906728 num_examples: 1147 download_size: 910927 dataset_size: 6669126 - config_name: fr features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3555470 num_examples: 4404 - name: neoplasm_dev num_bytes: 537948 num_examples: 679 - name: neoplasm_test num_bytes: 1011572 num_examples: 1251 - name: glaucoma_test num_bytes: 912823 num_examples: 1247 - name: mixed_test num_bytes: 946807 num_examples: 1147 download_size: 929512 dataset_size: 6964620 - config_name: it features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3279617 num_examples: 4405 - name: neoplasm_dev num_bytes: 495956 num_examples: 679 - name: neoplasm_test num_bytes: 934068 num_examples: 1251 - name: glaucoma_test num_bytes: 862835 num_examples: 1247 - name: mixed_test num_bytes: 877966 num_examples: 1147 download_size: 897597 dataset_size: 6450442 configs: - config_name: en data_files: - split: neoplasm_train path: en/neoplasm_train-* - split: neoplasm_dev path: en/neoplasm_dev-* - split: neoplasm_test path: en/neoplasm_test-* - split: glaucoma_test path: en/glaucoma_test-* - split: mixed_test path: en/mixed_test-* - config_name: es data_files: - split: neoplasm_train path: es/neoplasm_train-* - split: neoplasm_dev path: es/neoplasm_dev-* - split: neoplasm_test path: es/neoplasm_test-* - split: glaucoma_test path: es/glaucoma_test-* - split: mixed_test path: es/mixed_test-* - config_name: fr data_files: - split: neoplasm_train path: fr/neoplasm_train-* - split: neoplasm_dev path: fr/neoplasm_dev-* - split: neoplasm_test path: fr/neoplasm_test-* - split: glaucoma_test path: fr/glaucoma_test-* - split: mixed_test path: fr/mixed_test-* - config_name: it data_files: - split: neoplasm_train path: it/neoplasm_train-* - split: neoplasm_dev path: it/neoplasm_dev-* - split: neoplasm_test path: it/neoplasm_test-* - split: glaucoma_test path: it/glaucoma_test-* - split: mixed_test path: it/mixed_test-* license: cc-by-nc-sa-4.0 task_categories: - token-classification language: - en - es - fr - it tags: - biology - medical pretty_name: Multilingual AbstRCT --- <p align="center"> <br> <img src="http://www.ixa.eus/sites/default/files/anitdote.png" style="width: 30%;"> <h2 align="center">Mutilingual AbstRCT</h2> <be> We translate the [AbstRCT English Argument Mining Dataset](https://gitlab.com/tomaye/abstrct) dataset to generate parallel French, Italian and Spanish versions using the NLLB200 3B parameter model and projected using word alignment tools. The projections have been manually corrected. For more info about the original English AbstRCT dataset [read the original paper](https://hal.archives-ouvertes.fr/hal-03264761/file/2020Journal_AI_in_Medicine_ArgMiningClinicalTrials_forhal.pdf). For the translation and projection data see [https://github.com/ragerri/abstrct-projections/tree/final](https://github.com/ragerri/abstrct-projections/tree/final). - 📖 Paper: [Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain]() - 🌐 Project Website: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote) - Code: [https://github.com/ragerri/abstrct-projections/tree/final](https://github.com/ragerri/abstrct-projections/tree/final) - Original Dataset: [https://gitlab.com/tomaye/abstrct](https://gitlab.com/tomaye/abstrct) - Funding: CHIST-ERA XAI 2019 call. Antidote (PCI2020-120717-2) funded by MCIN/AEI /10.13039/501100011033 and by European Union NextGenerationEU/PRTR ## Labels ```python { "O": 0, "B-Claim": 1, "I-Claim": 2, "B-Premise": 3, "I-Premise": 4, } ``` A `claim` is a concluding statement made by the author about the outcome of the study. In the medical domain it may be an assertion of a diagnosis or a treatment. A `premise` corresponds to an observation or measurement in the study (ground truth), which supports or attacks another argument component, usually a claim. It is important that they are observed facts, therefore, credible without further evidence. ## Citation If you use the **original English AbstRCT** please cite the following paper: ```bibtext @article{mayer2021enhancing, title={Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials}, author={Mayer, Tobias and Marro, Santiago and Cabrio, Elena and Villata, Serena}, journal={Artificial Intelligence in Medicine}, volume={118}, pages={102098}, year={2021}, publisher={Elsevier} } ``` If you use the **French, Italian and Spanish** versions then add the following reference: ```bibtex @misc{garcíaferrero2024medical, title={Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain}, author={Iker García-Ferrero and Rodrigo Agerri and Aitziber Atutxa Salazar and Elena Cabrio and Iker de la Iglesia and Alberto Lavelli and Bernardo Magnini and Benjamin Molinet and Johana Ramirez-Romero and German Rigau and Jose Maria Villa-Gonzalez and Serena Villata and Andrea Zaninello}, year={2024}, eprint={2404.07613}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

HiTZ

原始信息汇总

数据集概述

数据集配置

英文 (en)
- 特征:
  - id: 数据类型 int64
  - tokens: 序列类型 string
  - labels_txt: 序列类型 string
  - labels: 序列类型 int64
- 分割:
  - neoplasm_train: 3140715 字节, 4404 示例
  - neoplasm_dev: 476131 字节, 679 示例
  - neoplasm_test: 893795 字节, 1251 示例
  - glaucoma_test: 821598 字节, 1247 示例
  - mixed_test: 847284 字节, 1147 示例
- 下载大小: 787800 字节
- 数据集大小: 6179523 字节
西班牙文 (es)
- 特征:
  - id: 数据类型 int64
  - tokens: 序列类型 string
  - labels_txt: 序列类型 string
  - labels: 序列类型 int64
- 分割:
  - neoplasm_train: 3409630 字节, 4404 示例
  - neoplasm_dev: 508674 字节, 679 示例
  - neoplasm_test: 959509 字节, 1251 示例
  - glaucoma_test: 884585 字节, 1247 示例
  - mixed_test: 906728 字节, 1147 示例
- 下载大小: 910927 字节
- 数据集大小: 6669126 字节
法文 (fr)
- 特征:
  - id: 数据类型 int64
  - tokens: 序列类型 string
  - labels_txt: 序列类型 string
  - labels: 序列类型 int64
- 分割:
  - neoplasm_train: 3555470 字节, 4404 示例
  - neoplasm_dev: 537948 字节, 679 示例
  - neoplasm_test: 1011572 字节, 1251 示例
  - glaucoma_test: 912823 字节, 1247 示例
  - mixed_test: 946807 字节, 1147 示例
- 下载大小: 929512 字节
- 数据集大小: 6964620 字节
意大利文 (it)
- 特征:
  - id: 数据类型 int64
  - tokens: 序列类型 string
  - labels_txt: 序列类型 string
  - labels: 序列类型 int64
- 分割:
  - neoplasm_train: 3279617 字节, 4405 示例
  - neoplasm_dev: 495956 字节, 679 示例
  - neoplasm_test: 934068 字节, 1251 示例
  - glaucoma_test: 862835 字节, 1247 示例
  - mixed_test: 877966 字节, 1147 示例
- 下载大小: 897597 字节
- 数据集大小: 6450442 字节

数据文件路径

英文 (en)
- neoplasm_train: en/neoplasm_train-*
- neoplasm_dev: en/neoplasm_dev-*
- neoplasm_test: en/neoplasm_test-*
- glaucoma_test: en/glaucoma_test-*
- mixed_test: en/mixed_test-*
西班牙文 (es)
- neoplasm_train: es/neoplasm_train-*
- neoplasm_dev: es/neoplasm_dev-*
- neoplasm_test: es/neoplasm_test-*
- glaucoma_test: es/glaucoma_test-*
- mixed_test: es/mixed_test-*
法文 (fr)
- neoplasm_train: fr/neoplasm_train-*
- neoplasm_dev: fr/neoplasm_dev-*
- neoplasm_test: fr/neoplasm_test-*
- glaucoma_test: fr/glaucoma_test-*
- mixed_test: fr/mixed_test-*
意大利文 (it)
- neoplasm_train: it/neoplasm_train-*
- neoplasm_dev: it/neoplasm_dev-*
- neoplasm_test: it/neoplasm_test-*
- glaucoma_test: it/glaucoma_test-*
- mixed_test: it/mixed_test-*

许可

cc-by-nc-sa-4.0

任务类别

token-classification

语言

英文 (en)
西班牙文 (es)
法文 (fr)
意大利文 (it)

HiTZ/multilingual-abstrct

数据集概述

数据集配置

数据文件路径

许可

任务类别

语言

标签