five

BSC-LT/AbScientia

收藏
Hugging Face2026-03-02 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/AbScientia
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification language: - es pretty_name: AbScientia size_categories: - 10K<n<100K dataset_info: - config_name: LexBOE features: - name: id dtype: string - name: sentence dtype: string - name: label dtype: string splits: - name: train num_examples: 46787 - name: dev num_examples: 5848 - name: test num_examples: 5849 license: cc-by-nc-nd-4.0 --- # Dataset Card for AbScientia ## Dataset summary AbScientia is a Spanish **STEM scientific text classification dataset** built from scientific abstracts collected from official academic repositories and open-access sources. The dataset focuses on **Science, Technology, Engineering, and Mathematics (STEM)** disciplines and reflects domain-specific scientific language in Spanish. This dataset is designed to support the training and evaluation of encoder models on STEM scientific domain classification tasks in Spanish. - **Curated by:** Barcelona Supercomputing Center (BSC) - **Funded by:** [ALIA](https://alia.gob.es/) - **Language(s) (NLP):** Spanish (`es`) - **License:** CC BY-NC-ND 4.0 ## Dataset Details ### Dataset Description AbScientia was constructed from scientific abstracts obtained from open academic repositories and scientific publication platforms. The original repository metadata was analyzed and consolidated into a set of **24 STEM scientific categories**. These categories are as follows: | Label | Description | |------|-------------| | Ingeniería | Engineering disciplines research | | Química | Chemical sciences and molecular research | | Telecomunicaciones | Telecommunications and communication technologies | | Matemáticas | Mathematical theory and applied mathematics | | Bioquímica | Biochemical processes and molecular biology | | Enfermería | Nursing science and clinical care practices | | Geología | Earth sciences and geological processes | | Medicina | Medical research and clinical sciences | | Informática | Computer science | | Arquitectura | Architecture and built environment engineering | | Fisiología | Physiological processes and biological functions | | Biología | Biological sciences research | | Estadística | Statistical theory and data analysis methods | | Fisioterapia | Physical therapy and rehabilitation sciences | | Aeronáutica | Aeronautical engineering and aerospace technologies | | Nutrición | Nutrition science and dietary research | | Tecnología | Applied technological research | | Geografía | Physical geography and environmental analysis | | Genética | Genetics and hereditary biological processes | | Ciencias Del Deporte | Sports science | | Farmacología | Pharmacology and drug research | | Física | Physical sciences and theoretical physics | | Anatomía | Anatomical structure and morphological studies | | Psicología | Psychology and behavioral research | ## Dataset Structure Each instance in AbScientia consists of a scientific abstract linked to a scientific domain label from the previous list. ### Instance example Each entry contains: - `id`: Unique document identifier - `text`: Scientific abstract - `label`: Scientific domain category **Example:** ```json { "id": "s-e_Buah_April2025_d-01_r-9b356672_c-0001", "text": "La salud de un suelo se puede definir como la capacidad del mismo para funcionar como un sistema vivo dentro de un ecosistema, sustentar la productividad biológica...", "label": "Biología" } ``` ## Dataset Sources The texts in AbScientia were collected from open-access scientific repositories and institutional research collections derived from the RECOLECTA BSC internal corpus. All documents were manually reviewed to verify licensing conditions. Licenses were checked document by document, and the most restrictive license identified across the collection was applied to the dataset as a whole, resulting in the adoption of the CC BY-NC-ND 4.0 license. ## Uses ### Direct Use AbScientia is intended for research and development in scientific natural language processing, particularly for **STEM text classification** tasks in Spanish. Typical use cases include: - Training and evaluating encoder-based models on STEM scientific text classification - Benchmarking Spanish scientific language understanding - Studying domain adaptation and representation learning in scientific NLP - Developing downstream scientific NLP applications in research contexts ### Out-of-Scope Use AbScientia is **not intended** for: - Scientific or medical decision-making without expert validation - Use in production systems without additional validation and domain-specific safeguards - Applications that attempt to recover or infer sensitive personal information from the texts - Any use that violates applicable ethical guidelines or data protection regulations ## Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/). ## Contact point Language Technologies Lab (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
提供机构:
BSC-LT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作