fabriciocarraro/tulu-3-sft-personas-instruction-following-es
收藏Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fabriciocarraro/tulu-3-sft-personas-instruction-following-es
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Tulu 3 SFT Personas Instruction Following (Spanish Translation)
language:
- es
license: odc-by
task_categories:
- text-generation
multilinguality: monolingual
size_categories:
- 10K<n<100K
annotations_creators:
- machine-generated
source_datasets:
- allenai/tulu-3-sft-personas-instruction-following
tags:
- instruction-following
- chat
- spanish
- synthetic
- translated
---
# Tulu 3 SFT Personas Instruction Following (Spanish Translation)
This dataset is a Spanish translation of [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following), a 30k-example supervised fine-tuning dataset designed to improve instruction following and constraint satisfaction in chat models.
The translation was created to make this style of instruction-following data more useful for Spanish-language model development while keeping the original task structure intact.
## Dataset Summary
- Total rows: `29,980`
- Language: Spanish
- Source dataset language: English
- Chat format: 2-message conversations (`user`, `assistant`)
- Translation method: OpenAI `gpt-5.4-mini`
- Structurally verified translations: `29,765`
- Rows present but not structurally verified: `215`
The dataset keeps the original prompt/response pairing style, the original constraint labels, and adds translation metadata so users can filter to a stricter subset if they want.
## What Is In The Dataset
Each example contains:
- `id`: original example identifier from the source dataset
- `prompt`: Spanish user prompt
- `messages`: translated chat messages in ShareGPT-style message format
- `constraints`: original constraint tags from the source dataset
- `source`: provenance tag for the translated dataset
- `translation_verified`: boolean flag produced by local structural checks
- `translation_checks`: per-message structural verification details
- `translation_model`: translation model used for the example
The `constraints` field is preserved from the original dataset and remains useful for filtering or analysis by instruction type.
## Translation Methodology
The translation pipeline was designed to preserve instruction-following behavior, not just literal meaning. In particular, it aimed to:
- Translate all natural-language content into Spanish
- Preserve the number of messages and the role sequence
- Preserve formatting-sensitive instructions such as JSON output, bullet counts, headings, paragraph counts, quoting conventions, and title wrappers
- Preserve placeholders and structural markers
- Adapt explicit language constraints when needed so that requests like "answer in English" become equivalent Spanish-language constraints
Translations were generated in batches through the OpenAI Responses API using structured outputs.
## Verification
Each translated example was checked with lightweight structural heuristics. These checks include preservation of:
- placeholder counts
- heading counts
- bullet/list structure
- divider markers
- JSON validity when the original target was JSON
- some casing and punctuation-sensitive constraints
Examples that pass all recorded checks are marked with `translation_verified = true`.
This flag is intentionally conservative:
- `true` means the example passed the implemented structural checks
- `false` does **not** necessarily mean the translation is unusable; it means at least one structural check did not pass
## Data Format
Example schema:
```python
{
"id": "personas_IF_...",
"prompt": "Escribe ...",
"messages": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
],
"constraints": ["format:title", "length constraints:number of words"],
"source": "tulu_personas_if_es_openai",
"translation_verified": True,
"translation_checks": [...],
"translation_model": "gpt-5.4-mini"
}
```
## Licensing And Provenance
This dataset is derived from:
- [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following)
The source dataset card lists the license as `odc-by`. This translated release should be treated as a derivative work of the original dataset, with attribution to the original creators.
If you reuse this dataset, please cite or acknowledge both:
- the original AllenAI dataset
- this Spanish translated derivative
## Acknowledgements
Thanks to the AllenAI team for releasing the original Tulu 3 instruction-following data, and to the open-source Hugging Face ecosystem for making derivative dataset publication straightforward.
## Citation
If you use this translated dataset, please cite the original dataset and mention this Spanish translation release in your implementation details or dataset appendix.
提供机构:
fabriciocarraro



