BSC-LT/LexBOE
收藏Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/LexBOE
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- text-classification
language:
- es
pretty_name: LexBOE
size_categories:
- 10K<n<100K
dataset_info:
- config_name: LexBOE
features:
- name: id
dtype: string
- name: sentence
dtype: string
- name: label
dtype: string
splits:
- name: train
num_examples: 46762
- name: dev
num_examples: 5845
- name: test
num_examples: 5846
license: cc-by-4.0
---
# Dataset Card for LexBOE
## Dataset summary
LexBOE is a Spanish legal text classification dataset built from articles extracted from the *Boletín Oficial del Estado* (BOE), the official source of legislation and administrative acts in Spain. The articles included in the dataset were published between 2022 and 2024.
LexBOE reflects contemporary legal-administrative language and is intended for the training and evaluation of language models on legal text classification tasks.
- **Curated by:** Barcelona Supercomputing Center (BSC)
- **Funded by:** [ALIA](https://alia.gob.es/)
- **Language(s) (NLP):** Spanish (`es`)
- **License:** CC BY 4.0
## Dataset Details
### Dataset Description
LexBOE was constructed through systematic extraction of articles and metadata using the official BOE API. The original metadata was analyzed and then consolidated into a set of **14 legal categories**. These categories are as follows:
| Label | Description |
|------|-------------|
| Funcionarios y Personal | Public employment and civil service|
| Normativas | Laws and regulations |
| Administración | Administrative organization and procedures |
| Educación | Education systems, institutions, and policies |
| Economía | Economic and financial matters |
| Energía | Energy policy and infrastructure |
| Judicial | Courts, legal proceedings, and judicial bodies |
| Cultura | Cultural institutions and activities |
| Salud | Public health and healthcare |
| Transporte | Transport systems and infrastructure |
| Fuerzas | Security forces and defense-related matters |
| Vivienda | Housing and urban development |
LexBOE also applies a pseudo-anonymization process in which sensitive personal information is replaced with formally and semantically equivalent values, preserving the linguistic structure of the original texts.
## Dataset Structure
An example of instance looks as follows:
```json
{
"id": "BOE-A-2024-23238",
"sentence": "Advertido error en la Resolución de 11 de octubre de 2024, de la Universidad Pablo de Olavide, de Sevilla, por la que se convoca Concurso de Acceso a plazas de Cuerpos Docentes Universitarios, publicada en el «Boletín Oficial del Estado» el 23 de octubre de 2024, se transcribe a continuación la oportuna rectificación:\nEn la página 135662, en el apartado 1. Legislación, donde dice:\n«Los concursos se regirán por lo dispuesto en el artículo 71.1».\nDebe decir:\n«Los concursos se regirán por lo dispuesto en el artículo 71.2».\nSevilla, 31 de octubre de 2024.–El Rector, Francisco Oliva Blázquez.",
"label": "Educación"
}
```
### Dataset Sources
The texts in LexBOE were extracted from the *Boletín Oficial del Estado* (BOE) using the official [BOE public API](https://www.boe.es/datosabiertos/api/api.php).
## Uses
### Direct Use
LexBOE is intended for research and development in legal natural language processing, particularly for **text classification** tasks in Spanish. Typical use cases include:
- Training and evaluating encoder-based models on legal text classification
- Benchmarking Spanish legal language understanding
- Studying domain adaptation and representation learning in the legal domain
- Developing downstream legal NLP applications in a research context
### Out-of-Scope Use
LexBOE is **not intended** for:
- Legal advice, decision-making, or interpretation of legal obligations
- Use in production systems without additional validation and domain-specific safeguards
- Applications that attempt to recover or infer real personal or institutional identities from the texts
- Any use that violates applicable data protection regulations or ethical guidelines
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/).
## Acknowledgements
This dataset is released in conjunction with the work presented in Tamayo Mela et al., *MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation*, as part of the evaluation of domain-adapted encoder models in the legal domain.
## Citation
```bibtex
@article{tamayo2026mrbert,
title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation},
author={Tamayo, Daniel and Lacunza, I{\~n}aki and Rivera-Hidalgo, Paula and Da Dalt, Severino and Aula-Blasco, Javier and Gonzalez-Agirre, Aitor and Villegas, Marta},
journal={arXiv preprint arXiv:2602.21379},
year={2026}
}
```
## Contact point
Language Technologies Lab (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
提供机构:
BSC-LT



