five

fabriciocarraro/tulu-3-sft-personas-instruction-following-es

收藏
Hugging Face2026-04-02 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/fabriciocarraro/tulu-3-sft-personas-instruction-following-es
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Tulu 3 SFT Personas Instruction Following (Spanish Translation) language: - es license: odc-by task_categories: - text-generation multilinguality: monolingual size_categories: - 10K<n<100K annotations_creators: - machine-generated source_datasets: - allenai/tulu-3-sft-personas-instruction-following tags: - instruction-following - chat - spanish - synthetic - translated --- # Tulu 3 SFT Personas Instruction Following (Spanish Translation) This dataset is a Spanish translation of [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following), a 30k-example supervised fine-tuning dataset designed to improve instruction following and constraint satisfaction in chat models. The translation was created to make this style of instruction-following data more useful for Spanish-language model development while keeping the original task structure intact. ## Dataset Summary - Total rows: `29,980` - Language: Spanish - Source dataset language: English - Chat format: 2-message conversations (`user`, `assistant`) - Translation method: OpenAI `gpt-5.4-mini` - Structurally verified translations: `29,765` - Rows present but not structurally verified: `215` The dataset keeps the original prompt/response pairing style, the original constraint labels, and adds translation metadata so users can filter to a stricter subset if they want. ## What Is In The Dataset Each example contains: - `id`: original example identifier from the source dataset - `prompt`: Spanish user prompt - `messages`: translated chat messages in ShareGPT-style message format - `constraints`: original constraint tags from the source dataset - `source`: provenance tag for the translated dataset - `translation_verified`: boolean flag produced by local structural checks - `translation_checks`: per-message structural verification details - `translation_model`: translation model used for the example The `constraints` field is preserved from the original dataset and remains useful for filtering or analysis by instruction type. ## Translation Methodology The translation pipeline was designed to preserve instruction-following behavior, not just literal meaning. In particular, it aimed to: - Translate all natural-language content into Spanish - Preserve the number of messages and the role sequence - Preserve formatting-sensitive instructions such as JSON output, bullet counts, headings, paragraph counts, quoting conventions, and title wrappers - Preserve placeholders and structural markers - Adapt explicit language constraints when needed so that requests like "answer in English" become equivalent Spanish-language constraints Translations were generated in batches through the OpenAI Responses API using structured outputs. ## Verification Each translated example was checked with lightweight structural heuristics. These checks include preservation of: - placeholder counts - heading counts - bullet/list structure - divider markers - JSON validity when the original target was JSON - some casing and punctuation-sensitive constraints Examples that pass all recorded checks are marked with `translation_verified = true`. This flag is intentionally conservative: - `true` means the example passed the implemented structural checks - `false` does **not** necessarily mean the translation is unusable; it means at least one structural check did not pass ## Data Format Example schema: ```python { "id": "personas_IF_...", "prompt": "Escribe ...", "messages": [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ], "constraints": ["format:title", "length constraints:number of words"], "source": "tulu_personas_if_es_openai", "translation_verified": True, "translation_checks": [...], "translation_model": "gpt-5.4-mini" } ``` ## Licensing And Provenance This dataset is derived from: - [`allenai/tulu-3-sft-personas-instruction-following`](https://huggingface.co/datasets/allenai/tulu-3-sft-personas-instruction-following) The source dataset card lists the license as `odc-by`. This translated release should be treated as a derivative work of the original dataset, with attribution to the original creators. If you reuse this dataset, please cite or acknowledge both: - the original AllenAI dataset - this Spanish translated derivative ## Acknowledgements Thanks to the AllenAI team for releasing the original Tulu 3 instruction-following data, and to the open-source Hugging Face ecosystem for making derivative dataset publication straightforward. ## Citation If you use this translated dataset, please cite the original dataset and mention this Spanish translation release in your implementation details or dataset appendix.
提供机构:
fabriciocarraro
二维码
社区交流群
二维码
科研交流群
商业服务