BSC-LT/AbSanitas

Name: BSC-LT/AbSanitas
Creator: BSC-LT
Published: 2026-03-29 10:32:39
License: 暂无描述

Hugging Face2026-03-29 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/BSC-LT/AbSanitas

下载链接

链接失效反馈

官方服务：

资源简介：

--- pretty_name: AbSanitas size_categories: - 10K<n<100K dataset_info: - config_name: corpus features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 12596 - config_name: queries features: - name: _id dtype: string - name: text dtype: string splits: - name: train num_examples: 25192 - config_name: qrels features: - name: query-id dtype: string - name: corpus-id dtype: string - name: score dtype: int32 splits: - name: train num_examples: 25192 license: cc-by-nc-nd-4.0 task_categories: - text-retrieval - text-ranking language: - es --- # Dataset Card for AbSanitas ## Dataset summary AbSanitas is a Spanish biomedical information retrieval dataset built from biomedical texts collected from official academic repositories and open-access sources. This dataset is designed to support the training and evaluation of encoder models on biomedical retrieval and semantic matching tasks in Spanish. - **Curated by:** Barcelona Supercomputing Center (BSC) - **Funded by:** [ALIA](https://alia.gob.es/) - **Language(s) (NLP):** Spanish (`es`) - **License:** CC BY-NC-ND 4.0 ## Dataset Details ### Dataset Description AbSanitas was constructed from biomedical abstracts and documents obtained from open academic repositories and scientific publication platforms. The dataset follows a **query–document relevance** structure, where two distinct queries are associated to one document passage. Queries were **synthetically generated** to reflect realistic biomedical information needs and were subsequently validated to ensure semantic alignment with the associated documents. AbSanitas focuses on preserving domain-specific terminology and biomedical language, making it suitable for evaluating encoder models in applied health and biomedical NLP settings. ## Dataset Structure AbSanitas follows a standard information retrieval (IR) format composed of three components: corpus, queries, and relevance judgments (qrels). ### Corpus The corpus contains biomedical documents and abstracts written in Spanish. Each entry corresponds to a document linked to its unique ID for traceability. **Example:** ```json { "_id": "s-UPC_April2025_d-01_r-fc48157b_c-0001", "text": "Método: Se evaluaron 138 pacientes de distintas edades que cumplían los criterios establecidos en el estudio. Estos fueron divididos en..." } ``` ### Queries Each document is associated with two distinct queries, which reference different pieces of information given in said document. Each entry corresponds to a unique ID built from the document ID and the query number (`q1`/`q2`), and the text of said query. **Example:** ```json { "_id": "s-UPC_April2025_d-01_r-fc48157b_c-0001_q1", "text": "¿Qué grupos de edad mostraron mejores movimientos sacádicos y de seguimiento en comparación con los niños y los grandes según las pruebas NSUCO, Groffman y DEM/ADEM?" }, { "_id": "s-UPC_April2025_d-01_r-fc48157b_c-0001_q2", "text": "¿Qué aspecto de los movimientos oculares no se ve afectado por la edad según las conclusiones del estudio?"} } ``` ### Qrels (Relevance Judgments) The qrels define the relevance relationships between queries and corpus documents. Each query pair is associated with one document judged as relevant. **Example:** | query-id | corpus-id | score | |-----------|------------|---------| | s-UPC_April2025_d-01_r-fc48157b_c-0001_q1 | s-UPC_April2025_d-01_r-fc48157b_c-0001 | 1 | | s-UPC_April2025_d-01_r-fc48157b_c-0001_q2 | s-UPC_April2025_d-01_r-fc48157b_c-0001 | 1 | | s-UPC_April2025_d-01_r-f9ecdf1c_c-0001_q1 | s-UPC_April2025_d-01_r-fc48157b_c-0001 | 0 | ### Dataset Sources The texts in AbSanitas were collected from open-access biomedical and scientific repositories and institutional research collections derived from the RECOLECTA BSC internal corpus. All documents were manually reviewed to verify licensing conditions. Licenses were checked document by document, and the most restrictive license identified across the collection was applied to the dataset as a whole, resulting in the adoption of the **CC BY-NC-ND 4.0** license. ## Uses ### Direct Use AbSanitas is intended for research and development in biomedical natural language processing, particularly for **information retrieval and semantic search** tasks in Spanish. Typical use cases include: - Training and evaluating retrieval and ranking models in the biomedical domain - Benchmarking Spanish biomedical information retrieval systems - Studying domain adaptation and representation learning in biomedical NLP - Developing retrieval-based biomedical question answering systems in research contexts ### Out-of-Scope Use AbSanitas is **not intended** for: - Clinical decision-making or medical advice - Use in healthcare or diagnostic systems without expert validation - Applications that attempt to recover or infer sensitive personal health information - Any use that violates applicable ethical guidelines or data protection regulations ## Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/). ## Acknowledgements This dataset is released in conjunction with the work presented in Tamayo Mela et al., *MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation*. The dataset forms part of the evaluation framework used to assess domain-adapted encoder models, specifically supporting the evaluation of biomedical retrieval capabilities in Spanish. ## Citation ```bibtex @article{tamayo2026mrbert, title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation}, author={Tamayo, Daniel and Lacunza, I{\~n}aki and Rivera-Hidalgo, Paula and Da Dalt, Severino and Aula-Blasco, Javier and Gonzalez-Agirre, Aitor and Villegas, Marta}, journal={arXiv preprint arXiv:2602.21379}, year={2026} } ``` ## Contact point Language Technologies Lab (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).

提供机构：

BSC-LT

5,000+

优质数据集

54 个

任务类型

进入经典数据集