SINAI/ALIA-legal-administrative-synthetic-instructions

Name: SINAI/ALIA-legal-administrative-synthetic-instructions
Creator: SINAI
Published: 2025-12-01 07:02:15
License: 暂无描述

Hugging Face2025-12-01 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/SINAI/ALIA-legal-administrative-synthetic-instructions

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - text-generation - question-answering - text-classification - fill-mask language: - en tags: - synthetic-data - legal - magpie - spanish - llm-training size_categories: - 8B+ configs: - config_name: default data_files: - split: instrucciones_con_contexto path: "datos_sinteticos-legal-instrucciones-con_contexto.jsonl" - split: preguntas_con_contexto path: "datos_sinteticos-legal-preguntas-con_contexto.jsonl" - split: instrucciones_sin_contexto path: "datos_sinteticos-legal-instrucciones-sin_contexto.jsonl" - split: preguntas_sin_contexto path: "datos_sinteticos-legal-preguntas-sin_contexto.jsonl" - split: vf_con_contexto_sin_justificacion path: "datos_sinteticos-legal-v_f-con_contexto-sin_justificacion.jsonl" - split: multirespuesta_sin_contexto_con_justificacion path: "datos_sinteticos-legal-multirespuesta-sin_contexto-con_justificacion.jsonl" - split: vf_sin_contexto_con_justificacion path: "datos_sinteticos-legal-v_f-sin_contexto-con_justificacion.jsonl" - split: multirespuesta_sin_contexto_sin_justificacion path: "datos_sinteticos-legal-multirespuesta-sin_contexto-sin_justificacion.jsonl" - split: vf_con_contexto_sin_aparicion_sin_justificacion path: "datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-sin_justificacion.jsonl" - split: multirespuesta_con_contexto_sin_justificacion path: "datos_sinteticos-legal-multirespuesta-con_contexto-sin_justificacion.jsonl" - split: vf_con_contexto_con_justificacion path: "datos_sinteticos-legal-v_f-con_contexto-con_justificacion.jsonl" - split: multirespuesta_con_contexto_con_justificacion path: "datos_sinteticos-legal-multirespuesta-con_contexto-con_justificacion.jsonl" - split: multirespuesta_con_contexto_sin_aparicion_sin_justificacion path: "datos_sinteticos-legal-multirespuesta-con_contexto_sin_aparicion-sin_justificacion.jsonl" - split: vf_sin_contexto_sin_justificacion path: "datos_sinteticos-legal-v_f-sin_contexto-sin_justificacion.jsonl" - split: vf_con_contexto_sin_aparicion_con_justificacion path: "datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-con_justificacion.jsonl" - split: multirespuesta_con_contexto_sin_aparicion_con_justificacion path: "datos_sinteticos-legal-multirespuesta-con_contexto_sin_aparicion-con_justificacion.jsonl" --- # Dataset Card for ALIA legal-administrative-synthetic-instructions Corpus This dataset contains the **legal administrative synthetic instructions corpus** generated using the **Magpie** methodology, adapted to the Spanish legal domain, in order to train and evaluate language models within the framework of the **ALIA Project**. It includes more than **7.4 million instruction-answer pairs**, covering modalities of: - General questions and instructions, - Context-based questions and instructions, - Multiple-choice test questions (multiple-choice and T/F) in general format, - Context-based test questions, - Versions with and without justification in test questions. The data was generated using the **Phi-4** model and includes an exhaustive process of **cleaning, semantic validation, and duplicate removal**. # Dataset Details ## Dataset Description This synthetic corpus was generated following the **Magpie methodology**, adapted to the Spanish legal domain. Its goal is to provide a massive and controlled resource for training linguistic models in Spanish that need to understand, respond to, and reason about legal information. It includes questions, instructions, and answers generated through an *instruct* model (Phi-4), both in general modality and conditioned by documentary context. It also includes specialized formats such as multiple-choice (A/B/C/D) and T/F, with and without justification. - **Curated by:** SINAI Research Group – Universidad de Jaén (CEATIC) - **Funded by:** Ministerio para la Transformación Digital y de la Función Pública — EU NextGenerationEU, within the project *Desarrollo de Modelos ALIA* - **Language (NLP):** Spanish - **License:** CC BY-SA 4.0 ## Dataset Sources - **Project repository:** https://github.com/sinai-uja/ALIA-UJA - **MAGPIE methodology:** https://arxiv.org/abs/2406.08464 ## Uses This dataset is designed to support: - Training of legal LLMs in Spanish - Legal QA systems - Controlled generation models or legal reasoning - Model evaluation on legal tasks - RAG systems based on specialized synthetic questions --- # Dataset Structure ## Data Instances Each generated instance follows the following structure: ```json { "system_prompt": "<SYSTEM_PROMPT>", "question": "<GENERATED_INSTRUCTION>", "response": "<GENERATED_RESPONSE>" } ``` ## Data Fields - **system_prompt**: Source prompt used by Phi-4 to generate the query. - **question**: Generated question or instruction. - **response**: Generated response. ## Data Splits | File | Tokens | Líneas | |--------|--------|--------| | datos_sinteticos-legal-instrucciones-con_contexto.jsonl | 2,193,000,545 | 1,304,300 | | datos_sinteticos-legal-preguntas-con_contexto.jsonl | 2,140,299,681 | 1,309,738 | | datos_sinteticos-legal-instrucciones-sin_contexto.jsonl | 1,489,117,414 | 1,575,096 | | datos_sinteticos-legal-preguntas-sin_contexto.jsonl | 1,327,174,522 | 1,715,133 | | datos_sinteticos-legal-v_f-con_contexto-sin_justificacion.jsonl | 158,277,017 | 132,036 | | datos_sinteticos-legal-multirespuesta-sin_contexto-con_justificacion.jsonl | 130,851,646 | 200,860 | | datos_sinteticos-legal-v_f-sin_contexto-con_justificacion.jsonl | 103,113,730 | 200,344 | | datos_sinteticos-legal-multirespuesta-sin_contexto-sin_justificacion.jsonl | 82,433,715 | 200,544 | | datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-sin_justificacion.jsonl | 66,069,001 | 225,557 | | datos_sinteticos-legal-multirespuesta-con_contexto-sin_justificacion.jsonl | 62,672,377 | 43,706 | | datos_sinteticos-legal-v_f-con_contexto-con_justificacion.jsonl | 56,683,073 | 34,574 | | datos_sinteticos-legal-multirespuesta-con_contexto-con_justificacion.jsonl | 56,240,005 | 34,908 | | datos_sinteticos-legal-multirespuesta-con_contexto_sin_apari cion-sin_justificacion.jsonl | 56,079,917 | 109,936 | | datos_sinteticos-legal-v_f-sin_contexto-sin_justificacion.jsonl | 53,454,775 | 203,498 | | datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-con_justificacion.jsonl | 43,947,955 | 61,554 | | datos_sinteticos-legal-multirespuesta-con_contexto_sin_aparicion-con_justificacion.jsonl | 42,632,875 | 60,025 | **Total:** **7,411,809 instances** **8,061,047,248 tokens** Aclaración: Cuando el nombre de los splits aparece "contexto_sin_aparicion", se refiere a casos en los que, aunque la pregunta del usuario se generó originalmente a partir de un contexto determinado, dicho contexto fue eliminado posteriormente de la parte correspondiente al usuario ~~Dado el siguiente contexto: {contexto}~~, quedando solamente la pregunta basada en ese contexto como referencia implícita. ## Example Usage To load the dataset: ```python from datasets import load_dataset # Load the complete dataset (all splits) dataset = load_dataset("sinai-uja/ALIA-legal-synthetic-instructions") ``` Load specific splits ```python # Instructions and questions without context no_context_instructions = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="instrucciones_sin_contexto", ) no_context_questions = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="preguntas_sin_contexto", ) # Instructions and questions with context context_instructions = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="instrucciones_con_contexto", ) context_questions = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="preguntas_con_contexto", ) ``` Load multiple-choice and true/false splits ```python # Multiple-choice (MCQ) formats mc_no_context_no_just = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="multirespuesta_sin_contexto_sin_justificacion", ) mc_context_with_just = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="multirespuesta_con_contexto_con_justificacion", ) mc_context_with_just = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="multirespuesta_con_contexto_sin_aparicion_sin_justificacion", ) # True/False formats tf_context_no_just = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="vf_con_contexto_sin_justificacion", ) tf_no_context_with_just = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="vf_sin_contexto_con_justificacion", ) tf_no_context_with_just = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="vf_con_contexto_sin_apariricion_con_justificacion", ) ``` Streaming (recommended for large splits) ```python streaming_dataset = load_dataset( "sinai-uja/ALIA-legal-synthetic-instructions", split="preguntas_sin_contexto", streaming=True, ) for i, example in enumerate(streaming_dataset): print(f"[{i}] Q: {example['question']}") print(f"A: {example['response']}\n") if i >= 4: break ``` --- # Dataset Creation ## Curation Rationale This corpus was created to address the lack of legal linguistic resources in Spanish and to train robust models within the ALIA project. ## Source Data The contexts used come from Spanish public legal and administrative documents. ## Data Collection and Processing ### MAGPIE Methodology - The model completes the *user* part of the message by generating the question or instruction. - Then it responds to its own generation. - For context-based modes, context is explicitly added. ### Generated modalities - Instructions - Questions - Context-based - Multiple-choice test (with/without justification) - T/F (with/without justification) ### Cleaning process 1. **Duplicate removal** 2. **Invalid response removal** 3. **Complete discard of data generated by Llama due to inconsistencies** 4. **Complete reconstruction with Phi-4** 5. **Semantic filter** with jina-embeddings-v3 (cosine ≥ 0.50) ### Quantitative impact **Questions/Instructions without context:** Questions and instructions that were generated without a prior context were removed. | Stage | No. | Reduction | |-------|----------|-----------| | Initial | 3,300,000 | – | | Basic cleaning | 2,911,366 | –11.8% | | Semantic filter | 2,910,210 | –11.8% cumulative | Total removed: **389,790 examples** ## Annotations No manual annotations exist. ## Personal and Sensitive Information The data comes from public documents. Additional sensitive or identifiable information was filtered. ## Citation ``` @misc{alia_synthetic_legal_magpie, title={ALIA Legal Administrative Synthetic Instructions Corpus}, author={SINAI Research Group, Universidad de Jaén}, year={2025}, url={https://github.com/sinai-uja/ALIA-UJA} } ``` --- # Considerations for Using the Data ## Social Impact of Dataset Facilitates the development of legal LLMs in Spanish and improves citizen access to legal information. ## Discussion of Biases Reflects: - biases from administrative language, - inherent limitations of the generating model, - structure of the Spanish legal system. ## Other Known Limitations - Loss of structure in some complex documents - More homogeneous style than real legal language - Does not replace an authentic legal corpus --- **Acknowledgments:** This dataset has been generated thanks to SCAYLE (Centro de Supercomputación de Castilla y León) – https://www.scayle.es/ which provided the needed computational resources on its CALENDULA supercomputing cluster. --- **Contact:** ALIA Project – https://www.alia.gob.es SINAI Research Group – https://sinai.ujaen.es Universidad de Jaén – https://www.ujaen.es

提供机构：

SINAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集