five

proxectonos/oasst2_gl

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/oasst2_gl
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - gl pretty_name: OASST2 Galician Subset task_categories: - text-generation task_ids: - dialogue-generation tags: - galician - instruction-tuning - chat - conversation - openassistant - translation license: apache-2.0 size_categories: - 1K<n<10K --- # OASST2 Galician Subset ## Dataset description This dataset is a Galician translation/adaptation of a subset of the [OASST2](OpenAssistant/oasst2) conversational dataset. It is intended for instruction tuning, dialogue modeling, and related experiments in Galician. This release contains 1,786 instances in JSONL format. It does not include the full original OASST2 dataset. The data preserves the original conversation-oriented structure, where messages are linked through tree and parent identifiers. ## Dataset structure The dataset is distributed in JSONL format. Each line contains one message node with the following fields: - `message_tree_id`: identifier of the conversation tree - `message_id`: identifier of the current message - `parent_id`: identifier of the parent message; empty for root messages - `lang`: language code - `role`: speaker role, typically `prompter` or `assistant` - `text`: message content in Galician ### Example ```json { "message_tree_id": "c55f670b-f384-48b0-ba71-e5a2b2c9137e", "message_id": "c55f670b-f384-48b0-ba71-e5a2b2c9137e", "parent_id": "", "lang": "gl", "role": "prompter", "text": "Crea un bloque de estatísticas para un poderoso monstro de tipo morto vivente en Dragóns e Alxubes quinta edición." } ``` ## Data source and creation This dataset is based on a subset of the original [OASST2](OpenAssistant/oasst2) dataset and was translated/adapted into Galician. It preserves the message-level conversational structure of the source data, including tree-level and parent-child relationships between turns. The main purpose of this version is to provide conversational and instruction-following data in Galician for experimentation, fine-tuning, and evaluation in low-resource settings. ## Intended uses This dataset can be used for: - conversational fine-tuning in Galician - dialogue generation - instruction tuning for chat-oriented models - multilingual or cross-lingual experiments - low-resource NLP research ## Limitations - This dataset is only a subset of the original OASST2 data. - Since this is a translated/adapted version, some examples may reflect translation choices, stylistic variation, or localized phrasing relative to the source dataset. - The dataset is structured at the message level, so conversation trees may need to be reconstructed programmatically for some use cases. - The quality of the data depends on the translation/adaptation process used. ## Licensing This dataset follows the same license as the original OASST2 dataset: Apache License 2.0. ## Usage Example with `datasets`: ```python from datasets import load_dataset ds = load_dataset("json", data_files="oasst2_gl_subset.jsonl") print(ds["train"][0]) ``` If you want to reconstruct conversations by tree: ```python from datasets import load_dataset ds = load_dataset("json", data_files="oasst2_gl_subset.jsonl")["train"] print(ds[0]["message_tree_id"], ds[0]["role"], ds[0]["text"]) ``` ## Acknowledgements This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU)
提供机构:
proxectonos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作