BSC-LT/AbScientia
收藏Hugging Face2026-03-02 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/AbScientia
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
language:
- es
pretty_name: AbScientia
size_categories:
- 10K<n<100K
dataset_info:
- config_name: LexBOE
features:
- name: id
dtype: string
- name: sentence
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 46787
- name: dev
num_examples: 5848
- name: test
num_examples: 5849
license: cc-by-nc-nd-4.0
---
# Dataset Card for AbScientia
## Dataset summary
AbScientia is a Spanish **STEM scientific text classification dataset** built from scientific abstracts collected from official academic repositories and open-access sources. The dataset focuses on **Science, Technology, Engineering, and Mathematics (STEM)** disciplines and reflects domain-specific scientific language in Spanish.
This dataset is designed to support the training and evaluation of encoder models on STEM scientific domain classification tasks in Spanish.
- **Curated by:** Barcelona Supercomputing Center (BSC)
- **Funded by:** [ALIA](https://alia.gob.es/)
- **Language(s) (NLP):** Spanish (`es`)
- **License:** CC BY-NC-ND 4.0
## Dataset Details
### Dataset Description
AbScientia was constructed from scientific abstracts obtained from open academic repositories and scientific publication platforms.
The original repository metadata was analyzed and consolidated into a set of **24 STEM scientific categories**. These categories are as follows:
| Label | Description |
|------|-------------|
| Ingeniería | Engineering disciplines research |
| Química | Chemical sciences and molecular research |
| Telecomunicaciones | Telecommunications and communication technologies |
| Matemáticas | Mathematical theory and applied mathematics |
| Bioquímica | Biochemical processes and molecular biology |
| Enfermería | Nursing science and clinical care practices |
| Geología | Earth sciences and geological processes |
| Medicina | Medical research and clinical sciences |
| Informática | Computer science |
| Arquitectura | Architecture and built environment engineering |
| Fisiología | Physiological processes and biological functions |
| Biología | Biological sciences research |
| Estadística | Statistical theory and data analysis methods |
| Fisioterapia | Physical therapy and rehabilitation sciences |
| Aeronáutica | Aeronautical engineering and aerospace technologies |
| Nutrición | Nutrition science and dietary research |
| Tecnología | Applied technological research |
| Geografía | Physical geography and environmental analysis |
| Genética | Genetics and hereditary biological processes |
| Ciencias Del Deporte | Sports science |
| Farmacología | Pharmacology and drug research |
| Física | Physical sciences and theoretical physics |
| Anatomía | Anatomical structure and morphological studies |
| Psicología | Psychology and behavioral research |
## Dataset Structure
Each instance in AbScientia consists of a scientific abstract linked to a scientific domain label from the previous list.
### Instance example
Each entry contains:
- `id`: Unique document identifier
- `text`: Scientific abstract
- `label`: Scientific domain category
**Example:**
```json
{
"id": "s-e_Buah_April2025_d-01_r-9b356672_c-0001",
"text": "La salud de un suelo se puede definir como la capacidad del mismo para funcionar como un sistema vivo dentro de un ecosistema, sustentar la productividad biológica...",
"label": "Biología"
}
```
## Dataset Sources
The texts in AbScientia were collected from open-access scientific repositories and institutional research collections derived from the RECOLECTA BSC internal corpus. All documents were manually reviewed to verify licensing conditions. Licenses were checked document by document, and the most restrictive license identified across the collection was applied to the dataset as a whole, resulting in the adoption of the CC BY-NC-ND 4.0 license.
## Uses
### Direct Use
AbScientia is intended for research and development in scientific natural language processing, particularly for **STEM text classification** tasks in Spanish. Typical use cases include:
- Training and evaluating encoder-based models on STEM scientific text classification
- Benchmarking Spanish scientific language understanding
- Studying domain adaptation and representation learning in scientific NLP
- Developing downstream scientific NLP applications in research contexts
### Out-of-Scope Use
AbScientia is **not intended** for:
- Scientific or medical decision-making without expert validation
- Use in production systems without additional validation and domain-specific safeguards
- Applications that attempt to recover or infer sensitive personal information from the texts
- Any use that violates applicable ethical guidelines or data protection regulations
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/).
## Contact point
Language Technologies Lab (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
提供机构:
BSC-LT



