SINAI/ALIA-legal-administrative-synthetic-instructions
收藏Hugging Face2025-12-01 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/SINAI/ALIA-legal-administrative-synthetic-instructions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text-generation
- question-answering
- text-classification
- fill-mask
language:
- en
tags:
- synthetic-data
- legal
- magpie
- spanish
- llm-training
size_categories:
- 8B+
configs:
- config_name: default
data_files:
- split: instrucciones_con_contexto
path: "datos_sinteticos-legal-instrucciones-con_contexto.jsonl"
- split: preguntas_con_contexto
path: "datos_sinteticos-legal-preguntas-con_contexto.jsonl"
- split: instrucciones_sin_contexto
path: "datos_sinteticos-legal-instrucciones-sin_contexto.jsonl"
- split: preguntas_sin_contexto
path: "datos_sinteticos-legal-preguntas-sin_contexto.jsonl"
- split: vf_con_contexto_sin_justificacion
path: "datos_sinteticos-legal-v_f-con_contexto-sin_justificacion.jsonl"
- split: multirespuesta_sin_contexto_con_justificacion
path: "datos_sinteticos-legal-multirespuesta-sin_contexto-con_justificacion.jsonl"
- split: vf_sin_contexto_con_justificacion
path: "datos_sinteticos-legal-v_f-sin_contexto-con_justificacion.jsonl"
- split: multirespuesta_sin_contexto_sin_justificacion
path: "datos_sinteticos-legal-multirespuesta-sin_contexto-sin_justificacion.jsonl"
- split: vf_con_contexto_sin_aparicion_sin_justificacion
path: "datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-sin_justificacion.jsonl"
- split: multirespuesta_con_contexto_sin_justificacion
path: "datos_sinteticos-legal-multirespuesta-con_contexto-sin_justificacion.jsonl"
- split: vf_con_contexto_con_justificacion
path: "datos_sinteticos-legal-v_f-con_contexto-con_justificacion.jsonl"
- split: multirespuesta_con_contexto_con_justificacion
path: "datos_sinteticos-legal-multirespuesta-con_contexto-con_justificacion.jsonl"
- split: multirespuesta_con_contexto_sin_aparicion_sin_justificacion
path: "datos_sinteticos-legal-multirespuesta-con_contexto_sin_aparicion-sin_justificacion.jsonl"
- split: vf_sin_contexto_sin_justificacion
path: "datos_sinteticos-legal-v_f-sin_contexto-sin_justificacion.jsonl"
- split: vf_con_contexto_sin_aparicion_con_justificacion
path: "datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-con_justificacion.jsonl"
- split: multirespuesta_con_contexto_sin_aparicion_con_justificacion
path: "datos_sinteticos-legal-multirespuesta-con_contexto_sin_aparicion-con_justificacion.jsonl"
---
# Dataset Card for ALIA legal-administrative-synthetic-instructions Corpus
This dataset contains the **legal administrative synthetic instructions corpus** generated using the **Magpie** methodology, adapted to the Spanish legal domain, in order to train and evaluate language models within the framework of the **ALIA Project**.
It includes more than **7.4 million instruction-answer pairs**, covering modalities of:
- General questions and instructions,
- Context-based questions and instructions,
- Multiple-choice test questions (multiple-choice and T/F) in general format,
- Context-based test questions,
- Versions with and without justification in test questions.
The data was generated using the **Phi-4** model and includes an exhaustive process of **cleaning, semantic validation, and duplicate removal**.
# Dataset Details
## Dataset Description
This synthetic corpus was generated following the **Magpie methodology**, adapted to the Spanish legal domain. Its goal is to provide a massive and controlled resource for training linguistic models in Spanish that need to understand, respond to, and reason about legal information.
It includes questions, instructions, and answers generated through an *instruct* model (Phi-4), both in general modality and conditioned by documentary context. It also includes specialized formats such as multiple-choice (A/B/C/D) and T/F, with and without justification.
- **Curated by:** SINAI Research Group – Universidad de Jaén (CEATIC)
- **Funded by:** Ministerio para la Transformación Digital y de la Función Pública — EU NextGenerationEU, within the project *Desarrollo de Modelos ALIA*
- **Language (NLP):** Spanish
- **License:** CC BY-SA 4.0
## Dataset Sources
- **Project repository:** https://github.com/sinai-uja/ALIA-UJA
- **MAGPIE methodology:** https://arxiv.org/abs/2406.08464
## Uses
This dataset is designed to support:
- Training of legal LLMs in Spanish
- Legal QA systems
- Controlled generation models or legal reasoning
- Model evaluation on legal tasks
- RAG systems based on specialized synthetic questions
---
# Dataset Structure
## Data Instances
Each generated instance follows the following structure:
```json
{
"system_prompt": "<SYSTEM_PROMPT>",
"question": "<GENERATED_INSTRUCTION>",
"response": "<GENERATED_RESPONSE>"
}
```
## Data Fields
- **system_prompt**: Source prompt used by Phi-4 to generate the query.
- **question**: Generated question or instruction.
- **response**: Generated response.
## Data Splits
| File | Tokens | Líneas |
|--------|--------|--------|
| datos_sinteticos-legal-instrucciones-con_contexto.jsonl | 2,193,000,545 | 1,304,300 |
| datos_sinteticos-legal-preguntas-con_contexto.jsonl | 2,140,299,681 | 1,309,738 |
| datos_sinteticos-legal-instrucciones-sin_contexto.jsonl | 1,489,117,414 | 1,575,096 |
| datos_sinteticos-legal-preguntas-sin_contexto.jsonl | 1,327,174,522 | 1,715,133 |
| datos_sinteticos-legal-v_f-con_contexto-sin_justificacion.jsonl | 158,277,017 | 132,036 |
| datos_sinteticos-legal-multirespuesta-sin_contexto-con_justificacion.jsonl | 130,851,646 | 200,860 |
| datos_sinteticos-legal-v_f-sin_contexto-con_justificacion.jsonl | 103,113,730 | 200,344 |
| datos_sinteticos-legal-multirespuesta-sin_contexto-sin_justificacion.jsonl | 82,433,715 | 200,544 |
| datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-sin_justificacion.jsonl | 66,069,001 | 225,557 |
| datos_sinteticos-legal-multirespuesta-con_contexto-sin_justificacion.jsonl | 62,672,377 | 43,706 |
| datos_sinteticos-legal-v_f-con_contexto-con_justificacion.jsonl | 56,683,073 | 34,574 |
| datos_sinteticos-legal-multirespuesta-con_contexto-con_justificacion.jsonl | 56,240,005 | 34,908 |
| datos_sinteticos-legal-multirespuesta-con_contexto_sin_apari cion-sin_justificacion.jsonl | 56,079,917 | 109,936 |
| datos_sinteticos-legal-v_f-sin_contexto-sin_justificacion.jsonl | 53,454,775 | 203,498 |
| datos_sinteticos-legal-v_f-con_contexto_sin_aparicion-con_justificacion.jsonl | 43,947,955 | 61,554 |
| datos_sinteticos-legal-multirespuesta-con_contexto_sin_aparicion-con_justificacion.jsonl | 42,632,875 | 60,025 |
**Total:**
**7,411,809 instances**
**8,061,047,248 tokens**
Aclaración: Cuando el nombre de los splits aparece "contexto_sin_aparicion", se refiere a casos en los que, aunque la pregunta del usuario se generó originalmente a partir de un contexto determinado, dicho contexto fue eliminado posteriormente de la parte correspondiente al usuario ~~Dado el siguiente contexto: {contexto}~~, quedando solamente la pregunta basada en ese contexto como referencia implícita.
## Example Usage
To load the dataset:
```python
from datasets import load_dataset
# Load the complete dataset (all splits)
dataset = load_dataset("sinai-uja/ALIA-legal-synthetic-instructions")
```
Load specific splits
```python
# Instructions and questions without context
no_context_instructions = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="instrucciones_sin_contexto",
)
no_context_questions = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="preguntas_sin_contexto",
)
# Instructions and questions with context
context_instructions = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="instrucciones_con_contexto",
)
context_questions = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="preguntas_con_contexto",
)
```
Load multiple-choice and true/false splits
```python
# Multiple-choice (MCQ) formats
mc_no_context_no_just = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="multirespuesta_sin_contexto_sin_justificacion",
)
mc_context_with_just = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="multirespuesta_con_contexto_con_justificacion",
)
mc_context_with_just = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="multirespuesta_con_contexto_sin_aparicion_sin_justificacion",
)
# True/False formats
tf_context_no_just = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="vf_con_contexto_sin_justificacion",
)
tf_no_context_with_just = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="vf_sin_contexto_con_justificacion",
)
tf_no_context_with_just = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="vf_con_contexto_sin_apariricion_con_justificacion",
)
```
Streaming (recommended for large splits)
```python
streaming_dataset = load_dataset(
"sinai-uja/ALIA-legal-synthetic-instructions",
split="preguntas_sin_contexto",
streaming=True,
)
for i, example in enumerate(streaming_dataset):
print(f"[{i}] Q: {example['question']}")
print(f"A: {example['response']}\n")
if i >= 4:
break
```
---
# Dataset Creation
## Curation Rationale
This corpus was created to address the lack of legal linguistic resources in Spanish and to train robust models within the ALIA project.
## Source Data
The contexts used come from Spanish public legal and administrative documents.
## Data Collection and Processing
### MAGPIE Methodology
- The model completes the *user* part of the message by generating the question or instruction.
- Then it responds to its own generation.
- For context-based modes, context is explicitly added.
### Generated modalities
- Instructions
- Questions
- Context-based
- Multiple-choice test (with/without justification)
- T/F (with/without justification)
### Cleaning process
1. **Duplicate removal**
2. **Invalid response removal**
3. **Complete discard of data generated by Llama due to inconsistencies**
4. **Complete reconstruction with Phi-4**
5. **Semantic filter** with jina-embeddings-v3 (cosine ≥ 0.50)
### Quantitative impact
**Questions/Instructions without context:**
Questions and instructions that were generated without a prior context were removed.
| Stage | No. | Reduction |
|-------|----------|-----------|
| Initial | 3,300,000 | – |
| Basic cleaning | 2,911,366 | –11.8% |
| Semantic filter | 2,910,210 | –11.8% cumulative |
Total removed: **389,790 examples**
## Annotations
No manual annotations exist.
## Personal and Sensitive Information
The data comes from public documents.
Additional sensitive or identifiable information was filtered.
## Citation
```
@misc{alia_synthetic_legal_magpie,
title={ALIA Legal Administrative Synthetic Instructions Corpus},
author={SINAI Research Group, Universidad de Jaén},
year={2025},
url={https://github.com/sinai-uja/ALIA-UJA}
}
```
---
# Considerations for Using the Data
## Social Impact of Dataset
Facilitates the development of legal LLMs in Spanish and improves citizen access to legal information.
## Discussion of Biases
Reflects:
- biases from administrative language,
- inherent limitations of the generating model,
- structure of the Spanish legal system.
## Other Known Limitations
- Loss of structure in some complex documents
- More homogeneous style than real legal language
- Does not replace an authentic legal corpus
---
**Acknowledgments:**
This dataset has been generated thanks to SCAYLE (Centro de Supercomputación de Castilla y León) – https://www.scayle.es/ which provided the needed computational resources on its CALENDULA supercomputing cluster.
---
**Contact:**
ALIA Project – https://www.alia.gob.es
SINAI Research Group – https://sinai.ujaen.es
Universidad de Jaén – https://www.ujaen.es
提供机构:
SINAI



