proxectonos/oasst2_gl
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/oasst2_gl
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- gl
pretty_name: OASST2 Galician Subset
task_categories:
- text-generation
task_ids:
- dialogue-generation
tags:
- galician
- instruction-tuning
- chat
- conversation
- openassistant
- translation
license: apache-2.0
size_categories:
- 1K<n<10K
---
# OASST2 Galician Subset
## Dataset description
This dataset is a Galician translation/adaptation of a subset of the [OASST2](OpenAssistant/oasst2) conversational dataset. It is intended for instruction tuning, dialogue modeling, and related experiments in Galician.
This release contains 1,786 instances in JSONL format. It does not include the full original OASST2 dataset. The data preserves the original conversation-oriented structure, where messages are linked through tree and parent identifiers.
## Dataset structure
The dataset is distributed in JSONL format. Each line contains one message node with the following fields:
- `message_tree_id`: identifier of the conversation tree
- `message_id`: identifier of the current message
- `parent_id`: identifier of the parent message; empty for root messages
- `lang`: language code
- `role`: speaker role, typically `prompter` or `assistant`
- `text`: message content in Galician
### Example
```json
{
"message_tree_id": "c55f670b-f384-48b0-ba71-e5a2b2c9137e",
"message_id": "c55f670b-f384-48b0-ba71-e5a2b2c9137e",
"parent_id": "",
"lang": "gl",
"role": "prompter",
"text": "Crea un bloque de estatísticas para un poderoso monstro de tipo morto vivente en Dragóns e Alxubes quinta edición."
}
```
## Data source and creation
This dataset is based on a subset of the original [OASST2](OpenAssistant/oasst2) dataset and was translated/adapted into Galician. It preserves the message-level conversational structure of the source data, including tree-level and parent-child relationships between turns.
The main purpose of this version is to provide conversational and instruction-following data in Galician for experimentation, fine-tuning, and evaluation in low-resource settings.
## Intended uses
This dataset can be used for:
- conversational fine-tuning in Galician
- dialogue generation
- instruction tuning for chat-oriented models
- multilingual or cross-lingual experiments
- low-resource NLP research
## Limitations
- This dataset is only a subset of the original OASST2 data.
- Since this is a translated/adapted version, some examples may reflect translation choices, stylistic variation, or localized phrasing relative to the source dataset.
- The dataset is structured at the message level, so conversation trees may need to be reconstructed programmatically for some use cases.
- The quality of the data depends on the translation/adaptation process used.
## Licensing
This dataset follows the same license as the original OASST2 dataset: Apache License 2.0.
## Usage
Example with `datasets`:
```python
from datasets import load_dataset
ds = load_dataset("json", data_files="oasst2_gl_subset.jsonl")
print(ds["train"][0])
```
If you want to reconstruct conversations by tree:
```python
from datasets import load_dataset
ds = load_dataset("json", data_files="oasst2_gl_subset.jsonl")["train"]
print(ds[0]["message_tree_id"], ds[0]["role"], ds[0]["text"])
```
## Acknowledgements
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA. (Esta publicación del proyecto Desarrollo de Modelos ALIA está financiada por el Ministerio para la Transformación Digital y de la Función Pública y por el Plan de Recuperación, Transformación y Resiliencia – Financiado por la Unión Europea – NextGenerationEU)
提供机构:
proxectonos



