BSC-LT/AbSanitas
收藏Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/AbSanitas
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: AbSanitas
size_categories:
- 10K<n<100K
dataset_info:
- config_name: corpus
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: train
num_examples: 12596
- config_name: queries
features:
- name: _id
dtype: string
- name: text
dtype: string
splits:
- name: train
num_examples: 25192
- config_name: qrels
features:
- name: query-id
dtype: string
- name: corpus-id
dtype: string
- name: score
dtype: int32
splits:
- name: train
num_examples: 25192
license: cc-by-nc-nd-4.0
task_categories:
- text-retrieval
- text-ranking
language:
- es
---
# Dataset Card for AbSanitas
## Dataset summary
AbSanitas is a Spanish biomedical information retrieval dataset built from biomedical texts collected from official academic repositories and open-access sources.
This dataset is designed to support the training and evaluation of encoder models on biomedical retrieval and semantic matching tasks in Spanish.
- **Curated by:** Barcelona Supercomputing Center (BSC)
- **Funded by:** [ALIA](https://alia.gob.es/)
- **Language(s) (NLP):** Spanish (`es`)
- **License:** CC BY-NC-ND 4.0
## Dataset Details
### Dataset Description
AbSanitas was constructed from biomedical abstracts and documents obtained from open academic repositories and scientific publication platforms. The dataset follows a **query–document relevance** structure, where two distinct queries are associated to one document passage.
Queries were **synthetically generated** to reflect realistic biomedical information needs and were subsequently validated to ensure semantic alignment with the associated documents.
AbSanitas focuses on preserving domain-specific terminology and biomedical language, making it suitable for evaluating encoder models in applied health and biomedical NLP settings.
## Dataset Structure
AbSanitas follows a standard information retrieval (IR) format composed of three components: corpus, queries, and relevance judgments (qrels).
### Corpus
The corpus contains biomedical documents and abstracts written in Spanish. Each entry corresponds to a document linked to its unique ID for traceability.
**Example:**
```json
{
"_id": "s-UPC_April2025_d-01_r-fc48157b_c-0001",
"text": "Método: Se evaluaron 138 pacientes de distintas edades que cumplían los criterios establecidos en el estudio. Estos fueron divididos en..."
}
```
### Queries
Each document is associated with two distinct queries, which reference different pieces of information given in said document. Each entry corresponds to a unique ID built from the document ID and the query number (`q1`/`q2`), and the text of said query.
**Example:**
```json
{
"_id": "s-UPC_April2025_d-01_r-fc48157b_c-0001_q1",
"text": "¿Qué grupos de edad mostraron mejores movimientos sacádicos y de seguimiento en comparación con los niños y los grandes según las pruebas NSUCO, Groffman y DEM/ADEM?"
},
{
"_id": "s-UPC_April2025_d-01_r-fc48157b_c-0001_q2",
"text": "¿Qué aspecto de los movimientos oculares no se ve afectado por la edad según las conclusiones del estudio?"}
}
```
### Qrels (Relevance Judgments)
The qrels define the relevance relationships between queries and corpus documents. Each query pair is associated with one document judged as relevant.
**Example:**
| query-id | corpus-id | score |
|-----------|------------|---------|
| s-UPC_April2025_d-01_r-fc48157b_c-0001_q1 | s-UPC_April2025_d-01_r-fc48157b_c-0001 | 1 |
| s-UPC_April2025_d-01_r-fc48157b_c-0001_q2 | s-UPC_April2025_d-01_r-fc48157b_c-0001 | 1 |
| s-UPC_April2025_d-01_r-f9ecdf1c_c-0001_q1 | s-UPC_April2025_d-01_r-fc48157b_c-0001 | 0 |
### Dataset Sources
The texts in AbSanitas were collected from open-access biomedical and scientific repositories and institutional research collections derived from the RECOLECTA BSC internal corpus. All documents were manually reviewed to verify licensing conditions.
Licenses were checked document by document, and the most restrictive license identified across the collection was applied to the dataset as a whole, resulting in the adoption of the **CC BY-NC-ND 4.0** license.
## Uses
### Direct Use
AbSanitas is intended for research and development in biomedical natural language processing, particularly for **information retrieval and semantic search** tasks in Spanish. Typical use cases include:
- Training and evaluating retrieval and ranking models in the biomedical domain
- Benchmarking Spanish biomedical information retrieval systems
- Studying domain adaptation and representation learning in biomedical NLP
- Developing retrieval-based biomedical question answering systems in research contexts
### Out-of-Scope Use
AbSanitas is **not intended** for:
- Clinical decision-making or medical advice
- Use in healthcare or diagnostic systems without expert validation
- Applications that attempt to recover or infer sensitive personal health information
- Any use that violates applicable ethical guidelines or data protection regulations
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/).
## Acknowledgements
This dataset is released in conjunction with the work presented in Tamayo Mela et al., *MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation*. The dataset forms part of the evaluation framework used to assess domain-adapted encoder models, specifically supporting the evaluation of biomedical retrieval capabilities in Spanish.
## Citation
```bibtex
@article{tamayo2026mrbert,
title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation},
author={Tamayo, Daniel and Lacunza, I{\~n}aki and Rivera-Hidalgo, Paula and Da Dalt, Severino and Aula-Blasco, Javier and Gonzalez-Agirre, Aitor and Villegas, Marta},
journal={arXiv preprint arXiv:2602.21379},
year={2026}
}
```
## Contact point
Language Technologies Lab (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
提供机构:
BSC-LT



