five

BSC-LT/LexBOE

收藏
Hugging Face2026-03-29 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/LexBOE
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - text-classification language: - es pretty_name: LexBOE size_categories: - 10K<n<100K dataset_info: - config_name: LexBOE features: - name: id dtype: string - name: sentence dtype: string - name: label dtype: string splits: - name: train num_examples: 46762 - name: dev num_examples: 5845 - name: test num_examples: 5846 license: cc-by-4.0 --- # Dataset Card for LexBOE ## Dataset summary LexBOE is a Spanish legal text classification dataset built from articles extracted from the *Boletín Oficial del Estado* (BOE), the official source of legislation and administrative acts in Spain. The articles included in the dataset were published between 2022 and 2024. LexBOE reflects contemporary legal-administrative language and is intended for the training and evaluation of language models on legal text classification tasks. - **Curated by:** Barcelona Supercomputing Center (BSC) - **Funded by:** [ALIA](https://alia.gob.es/) - **Language(s) (NLP):** Spanish (`es`) - **License:** CC BY 4.0 ## Dataset Details ### Dataset Description LexBOE was constructed through systematic extraction of articles and metadata using the official BOE API. The original metadata was analyzed and then consolidated into a set of **14 legal categories**. These categories are as follows: | Label | Description | |------|-------------| | Funcionarios y Personal | Public employment and civil service| | Normativas | Laws and regulations | | Administración | Administrative organization and procedures | | Educación | Education systems, institutions, and policies | | Economía | Economic and financial matters | | Energía | Energy policy and infrastructure | | Judicial | Courts, legal proceedings, and judicial bodies | | Cultura | Cultural institutions and activities | | Salud | Public health and healthcare | | Transporte | Transport systems and infrastructure | | Fuerzas | Security forces and defense-related matters | | Vivienda | Housing and urban development | LexBOE also applies a pseudo-anonymization process in which sensitive personal information is replaced with formally and semantically equivalent values, preserving the linguistic structure of the original texts. ## Dataset Structure An example of instance looks as follows: ```json { "id": "BOE-A-2024-23238", "sentence": "Advertido error en la Resolución de 11 de octubre de 2024, de la Universidad Pablo de Olavide, de Sevilla, por la que se convoca Concurso de Acceso a plazas de Cuerpos Docentes Universitarios, publicada en el «Boletín Oficial del Estado» el 23 de octubre de 2024, se transcribe a continuación la oportuna rectificación:\nEn la página 135662, en el apartado 1. Legislación, donde dice:\n«Los concursos se regirán por lo dispuesto en el artículo 71.1».\nDebe decir:\n«Los concursos se regirán por lo dispuesto en el artículo 71.2».\nSevilla, 31 de octubre de 2024.–El Rector, Francisco Oliva Blázquez.", "label": "Educación" } ``` ### Dataset Sources The texts in LexBOE were extracted from the *Boletín Oficial del Estado* (BOE) using the official [BOE public API](https://www.boe.es/datosabiertos/api/api.php). ## Uses ### Direct Use LexBOE is intended for research and development in legal natural language processing, particularly for **text classification** tasks in Spanish. Typical use cases include: - Training and evaluating encoder-based models on legal text classification - Benchmarking Spanish legal language understanding - Studying domain adaptation and representation learning in the legal domain - Developing downstream legal NLP applications in a research context ### Out-of-Scope Use LexBOE is **not intended** for: - Legal advice, decision-making, or interpretation of legal obligations - Use in production systems without additional validation and domain-specific safeguards - Applications that attempt to recover or infer real personal or institutional identities from the texts - Any use that violates applicable data protection regulations or ethical guidelines ## Funding This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos [ALIA](https://alia.gob.es/). ## Acknowledgements This dataset is released in conjunction with the work presented in Tamayo Mela et al., *MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation*, as part of the evaluation of domain-adapted encoder models in the legal domain. ## Citation ```bibtex @article{tamayo2026mrbert, title={MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation}, author={Tamayo, Daniel and Lacunza, I{\~n}aki and Rivera-Hidalgo, Paula and Da Dalt, Severino and Aula-Blasco, Javier and Gonzalez-Agirre, Aitor and Villegas, Marta}, journal={arXiv preprint arXiv:2602.21379}, year={2026} } ``` ## Contact point Language Technologies Lab (langtech@bsc.es) at the Barcelona Supercomputing Center (BSC).
提供机构:
BSC-LT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作