five

HiTZ/multilingual-abstrct

收藏
Hugging Face2024-04-12 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/HiTZ/multilingual-abstrct
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: en features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3140715 num_examples: 4404 - name: neoplasm_dev num_bytes: 476131 num_examples: 679 - name: neoplasm_test num_bytes: 893795 num_examples: 1251 - name: glaucoma_test num_bytes: 821598 num_examples: 1247 - name: mixed_test num_bytes: 847284 num_examples: 1147 download_size: 787800 dataset_size: 6179523 - config_name: es features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3409630 num_examples: 4404 - name: neoplasm_dev num_bytes: 508674 num_examples: 679 - name: neoplasm_test num_bytes: 959509 num_examples: 1251 - name: glaucoma_test num_bytes: 884585 num_examples: 1247 - name: mixed_test num_bytes: 906728 num_examples: 1147 download_size: 910927 dataset_size: 6669126 - config_name: fr features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3555470 num_examples: 4404 - name: neoplasm_dev num_bytes: 537948 num_examples: 679 - name: neoplasm_test num_bytes: 1011572 num_examples: 1251 - name: glaucoma_test num_bytes: 912823 num_examples: 1247 - name: mixed_test num_bytes: 946807 num_examples: 1147 download_size: 929512 dataset_size: 6964620 - config_name: it features: - name: id dtype: int64 - name: tokens sequence: string - name: labels_txt sequence: string - name: labels sequence: int64 splits: - name: neoplasm_train num_bytes: 3279617 num_examples: 4405 - name: neoplasm_dev num_bytes: 495956 num_examples: 679 - name: neoplasm_test num_bytes: 934068 num_examples: 1251 - name: glaucoma_test num_bytes: 862835 num_examples: 1247 - name: mixed_test num_bytes: 877966 num_examples: 1147 download_size: 897597 dataset_size: 6450442 configs: - config_name: en data_files: - split: neoplasm_train path: en/neoplasm_train-* - split: neoplasm_dev path: en/neoplasm_dev-* - split: neoplasm_test path: en/neoplasm_test-* - split: glaucoma_test path: en/glaucoma_test-* - split: mixed_test path: en/mixed_test-* - config_name: es data_files: - split: neoplasm_train path: es/neoplasm_train-* - split: neoplasm_dev path: es/neoplasm_dev-* - split: neoplasm_test path: es/neoplasm_test-* - split: glaucoma_test path: es/glaucoma_test-* - split: mixed_test path: es/mixed_test-* - config_name: fr data_files: - split: neoplasm_train path: fr/neoplasm_train-* - split: neoplasm_dev path: fr/neoplasm_dev-* - split: neoplasm_test path: fr/neoplasm_test-* - split: glaucoma_test path: fr/glaucoma_test-* - split: mixed_test path: fr/mixed_test-* - config_name: it data_files: - split: neoplasm_train path: it/neoplasm_train-* - split: neoplasm_dev path: it/neoplasm_dev-* - split: neoplasm_test path: it/neoplasm_test-* - split: glaucoma_test path: it/glaucoma_test-* - split: mixed_test path: it/mixed_test-* license: cc-by-nc-sa-4.0 task_categories: - token-classification language: - en - es - fr - it tags: - biology - medical pretty_name: Multilingual AbstRCT --- <p align="center"> <br> <img src="http://www.ixa.eus/sites/default/files/anitdote.png" style="width: 30%;"> <h2 align="center">Mutilingual AbstRCT</h2> <be> We translate the [AbstRCT English Argument Mining Dataset](https://gitlab.com/tomaye/abstrct) dataset to generate parallel French, Italian and Spanish versions using the NLLB200 3B parameter model and projected using word alignment tools. The projections have been manually corrected. For more info about the original English AbstRCT dataset [read the original paper](https://hal.archives-ouvertes.fr/hal-03264761/file/2020Journal_AI_in_Medicine_ArgMiningClinicalTrials_forhal.pdf). For the translation and projection data see [https://github.com/ragerri/abstrct-projections/tree/final](https://github.com/ragerri/abstrct-projections/tree/final). - 📖 Paper: [Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain]() - 🌐 Project Website: [https://univ-cotedazur.eu/antidote](https://univ-cotedazur.eu/antidote) - Code: [https://github.com/ragerri/abstrct-projections/tree/final](https://github.com/ragerri/abstrct-projections/tree/final) - Original Dataset: [https://gitlab.com/tomaye/abstrct](https://gitlab.com/tomaye/abstrct) - Funding: CHIST-ERA XAI 2019 call. Antidote (PCI2020-120717-2) funded by MCIN/AEI /10.13039/501100011033 and by European Union NextGenerationEU/PRTR ## Labels ```python { "O": 0, "B-Claim": 1, "I-Claim": 2, "B-Premise": 3, "I-Premise": 4, } ``` A `claim` is a concluding statement made by the author about the outcome of the study. In the medical domain it may be an assertion of a diagnosis or a treatment. A `premise` corresponds to an observation or measurement in the study (ground truth), which supports or attacks another argument component, usually a claim. It is important that they are observed facts, therefore, credible without further evidence. ## Citation If you use the **original English AbstRCT** please cite the following paper: ```bibtext @article{mayer2021enhancing, title={Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials}, author={Mayer, Tobias and Marro, Santiago and Cabrio, Elena and Villata, Serena}, journal={Artificial Intelligence in Medicine}, volume={118}, pages={102098}, year={2021}, publisher={Elsevier} } ``` If you use the **French, Italian and Spanish** versions then add the following reference: ```bibtex @misc{garcíaferrero2024medical, title={Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain}, author={Iker García-Ferrero and Rodrigo Agerri and Aitziber Atutxa Salazar and Elena Cabrio and Iker de la Iglesia and Alberto Lavelli and Bernardo Magnini and Benjamin Molinet and Johana Ramirez-Romero and German Rigau and Jose Maria Villa-Gonzalez and Serena Villata and Andrea Zaninello}, year={2024}, eprint={2404.07613}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
HiTZ
原始信息汇总

数据集概述

数据集配置

  • 英文 (en)

    • 特征:
      • id: 数据类型 int64
      • tokens: 序列类型 string
      • labels_txt: 序列类型 string
      • labels: 序列类型 int64
    • 分割:
      • neoplasm_train: 3140715 字节, 4404 示例
      • neoplasm_dev: 476131 字节, 679 示例
      • neoplasm_test: 893795 字节, 1251 示例
      • glaucoma_test: 821598 字节, 1247 示例
      • mixed_test: 847284 字节, 1147 示例
    • 下载大小: 787800 字节
    • 数据集大小: 6179523 字节
  • 西班牙文 (es)

    • 特征:
      • id: 数据类型 int64
      • tokens: 序列类型 string
      • labels_txt: 序列类型 string
      • labels: 序列类型 int64
    • 分割:
      • neoplasm_train: 3409630 字节, 4404 示例
      • neoplasm_dev: 508674 字节, 679 示例
      • neoplasm_test: 959509 字节, 1251 示例
      • glaucoma_test: 884585 字节, 1247 示例
      • mixed_test: 906728 字节, 1147 示例
    • 下载大小: 910927 字节
    • 数据集大小: 6669126 字节
  • 法文 (fr)

    • 特征:
      • id: 数据类型 int64
      • tokens: 序列类型 string
      • labels_txt: 序列类型 string
      • labels: 序列类型 int64
    • 分割:
      • neoplasm_train: 3555470 字节, 4404 示例
      • neoplasm_dev: 537948 字节, 679 示例
      • neoplasm_test: 1011572 字节, 1251 示例
      • glaucoma_test: 912823 字节, 1247 示例
      • mixed_test: 946807 字节, 1147 示例
    • 下载大小: 929512 字节
    • 数据集大小: 6964620 字节
  • 意大利文 (it)

    • 特征:
      • id: 数据类型 int64
      • tokens: 序列类型 string
      • labels_txt: 序列类型 string
      • labels: 序列类型 int64
    • 分割:
      • neoplasm_train: 3279617 字节, 4405 示例
      • neoplasm_dev: 495956 字节, 679 示例
      • neoplasm_test: 934068 字节, 1251 示例
      • glaucoma_test: 862835 字节, 1247 示例
      • mixed_test: 877966 字节, 1147 示例
    • 下载大小: 897597 字节
    • 数据集大小: 6450442 字节

数据文件路径

  • 英文 (en)

    • neoplasm_train: en/neoplasm_train-*
    • neoplasm_dev: en/neoplasm_dev-*
    • neoplasm_test: en/neoplasm_test-*
    • glaucoma_test: en/glaucoma_test-*
    • mixed_test: en/mixed_test-*
  • 西班牙文 (es)

    • neoplasm_train: es/neoplasm_train-*
    • neoplasm_dev: es/neoplasm_dev-*
    • neoplasm_test: es/neoplasm_test-*
    • glaucoma_test: es/glaucoma_test-*
    • mixed_test: es/mixed_test-*
  • 法文 (fr)

    • neoplasm_train: fr/neoplasm_train-*
    • neoplasm_dev: fr/neoplasm_dev-*
    • neoplasm_test: fr/neoplasm_test-*
    • glaucoma_test: fr/glaucoma_test-*
    • mixed_test: fr/mixed_test-*
  • 意大利文 (it)

    • neoplasm_train: it/neoplasm_train-*
    • neoplasm_dev: it/neoplasm_dev-*
    • neoplasm_test: it/neoplasm_test-*
    • glaucoma_test: it/glaucoma_test-*
    • mixed_test: it/mixed_test-*

许可

  • cc-by-nc-sa-4.0

任务类别

  • token-classification

语言

  • 英文 (en)
  • 西班牙文 (es)
  • 法文 (fr)
  • 意大利文 (it)

标签

python { "O": 0, "B-Claim": 1, "I-Claim": 2, "B-Premise": 3, "I-Premise": 4, }

  • claim: 作者关于研究结果的结论性陈述,在医学领域可能是诊断或治疗的断言。
  • premise: 研究中的观察或测量(事实),支持或反驳其他论点组件,通常是主张。这些是观察到的事实,因此无需进一步证据即可信。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作