proxectonos/wikipedia_multiple_choice_qa
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/wikipedia_multiple_choice_qa
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Galician and Portuguese Multiple-Choice QA Instruction Subsets
language:
- gl
- pt
license: cc-by-4.0
task_categories:
- text-generation
task_ids:
- text2text-generation
tags:
- galician
- portuguese
- instruction-tuning
- multiple-choice-qa
- llm-generated
- synthetic-data
- low-resource-nlp
size_categories:
- 1K<n<10K
configs:
- config_name: gl_wikipedia_multiple_choice_qa
data_files:
- split: train
path: gl_wikipedia_multiple-choice_qa.jsonl
- config_name: pt_wikipedia_multiple_choice_qa
data_files:
- split: train
path: pt_wikipedia_multiple-choice_qa.jsonl
---
# Galician and Portuguese Multiple-Choice QA Instruction Subsets
## Dataset description
This dataset contains two instruction-tuning subsets for multiple-choice question answering in Galician and Portuguese:
- `gl_wikipedia_multiple_choice_qa` (1,486 instances)
- `pt_wikipedia_multiple_choice_qa` (547 instances)
Both subsets are reformatted versions of QA data originally included in the `cpt_instruction_datasets` collection, adapted here as standalone instruction-style datasets.
Each example contains a context, an instruction prompt, a question, a list of candidate answers, the correct answer, the corresponding answer index, and the number of words in the context.
## Data source and creation
These datasets were created from Wikipedia paragraphs in Galician and Portuguese. For each paragraph, a large language model was prompted to generate a multiple-choice question together with several candidate answers, where only one answer should be correct.
The resulting examples were then reformatted as instruction-style datasets for multiple-choice question answering. Automatic filtering was applied in an attempt to improve overall quality and remove problematic generations. However, because the data was generated with LLM assistance, factual inconsistencies, hallucinations, ambiguous formulations, or incorrect answer options may still be present.
The Galician subset contains 1,486 instances and the Portuguese subset contains 547 instances.
## Dataset structure
The dataset is distributed in JSONL format and exposed as two subsets/configurations.
Each instance contains the following fields:
- `context`: source passage or contextual text
- `prompt`: instruction shown to the model
- `question`: multiple-choice question
- `answers`: list of answer options
- `correct_answer`: correct option text
- `answer_index`: index of the correct answer in the answer list
- `num_words`: number of words in the context
### Example
```json
{
"context": "A conservación destes fósiles está considerada como un enigma científico fascinante pois ao seren organismos de corpo brando, normalmente non fosilizarían. A diferenza doutras formas de vida de corpo brando posteriores (como as dos xistos de Burgess ou as calcarias de Solnhofen), os organismos ediacáricos non estaban localizados en ambientes restrinxidos suxeitos a condicións locais infrecuentes, senón que eran un fenómeno global. Por tanto, os procesos que interviñeron na fosilización deberon ser sistemáticos e presentes en todo o mundo. Deberon existir condicións moi diferentes ás actuais durante o período ediacárico que permitiron que estas delicadas criaturas se conservasen. A hipótese máis estendida é que os fósiles se preservaron grazas a que foron rapidamente cubertos por cinzas ou area, atrapándoos xunto á lama ou aos tapetes de microbios nos que vivían. As capas de cinza ofrecen máis detalles fósiles e poden datarse con precisión cunha marxe de erro dun millón de anos ou menos por medio da datación radiométrica.",
"prompt": "De acordo ao seguinte texto, cal é a opción correcta?:",
"question": "Que proceso puido ter lugar para permitir a preservación sistemática dos fósiles de organismos ediacáricos?",
"answers": [
"Os fósiles quedaron cubertos por cinzas",
"Os fósiles quedaron cubertos por cinzas",
"Os fósiles quedaron cubertos por area",
"As capas de cinza non ofrecen información sobre a súa preservación",
"Non existe ningunha teoría sobre este tema"
],
"correct_answer": "Os fósiles quedaron cubertos por cinzas",
"answer_index": 0,
"num_words": 230
}
```
## Intended uses
These datasets can be used for:
- instruction tuning of LLMs for multiple-choice QA
- multilingual or cross-lingual QA experiments
- low-resource NLP research
- evaluation of instruction-following behavior in Galician and Portuguese
- experiments on synthetic QA data
## Limitations
- These datasets were generated with LLM assistance and may contain factual inaccuracies or inconsistencies.
- Although automatic filtering was applied, the data should not be assumed to be fully error-free.
- The examples are reformatted synthetic instruction data and should not be treated as gold-standard human-annotated benchmarks.
- The two subsets differ substantially in size, which may affect multilingual comparisons.
## Licensing
This dataset is released under the CC BY 4.0 license.
## Usage
Example with `datasets`:
```python
from datasets import load_dataset
ds_gl = load_dataset(
"proxectonos/cpt_instruction_datasets",
"gl_wikipedia_multiple_choice_qa",
split="train"
)
ds_pt = load_dataset(
"proxectonos/cpt_instruction_datasets",
"pt_wikipedia_multiple_choice_qa",
split="train"
)
print(ds_gl[0])
print(ds_pt[0])
```
## Citation
If you use this dataset, please cite the following paper:
```bibtex
@inproceedings{rodriguez-etal-2025-continued,
title = "Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A {G}alician Case Study",
author = "Rodr{\'i}guez, Pablo and
Su{\'a}rez, Silvia Paniagua and
Gamallo, Pablo and
Docio, Susana Sotelo",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.240/",
doi = "10.18653/v1/2025.findings-acl.240",
pages = "4622--4637",
ISBN = "979-8-89176-256-5",
abstract = "Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework."
}
```
## Acknowledgements
These datasets were developed and compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215336.
提供机构:
proxectonos



