proxectonos/calame-gl
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/calame-gl
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- gl
pretty_name: calame-gl
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- galician
- evaluation
- benchmark
- language-modeling
- text-completion
- calame
license: mit
size_categories:
- 1K<n<10K
---
# CALAME Galician
## Dataset description
CALAME-gl is a Galician translation/adaptation of the Portuguese [CALAME-PT](https://huggingface.co/datasets/NOVA-vision-language/calame-pt) benchmark.
The dataset is composed of short texts or contexts and their respective last words. These contexts are designed to contain enough information for a human or a model to infer the final word, while avoiding contexts that are excessively specific or overly ambiguous.
This release contains 930 instances in JSON format and is intended primarily for evaluation.
## Dataset structure
The dataset is distributed in JSON format as a list of examples. Each instance contains the following fields:
- `id`: example identifier
- `sentence`: context in Galician
- `last_word`: final word associated with the context
### Example
```json
{
"id": 0,
"sentence": "Os fans de GTA están ansiosos polo lanzamento do próximo xogo da serie, cuxo lanzamento pódese demorar algúns anos máis. Os rumores apuntan a que o GTA VI será unha versión moderna de Vice City e contará cun mapa que muda co paso do tempo. Alén diso, existe a posibilidade dunha protagonista feminina, o que trae máis expectativas ao xogo. Mentres agardamos, queda imaxinar o que esa nova aventura nos",
"last_word": "depara"
}
```
## Data source and creation
This dataset is based on the Portuguese benchmark [CALAME-PT](https://huggingface.co/datasets/NOVA-vision-language/calame-pt) and was translated/adapted into Galician. The Galician version preserves the same evaluation-oriented structure as the original dataset: each example contains a context and its corresponding final word.
The goal of this version is to provide a Galician benchmark for evaluating a model's ability to infer or predict the final word of a context.
## Intended uses
This dataset can be used for:
- evaluation of language models in Galician
- text completion evaluation
- last-word prediction tasks
- low-resource NLP research
## Limitations
- This dataset is a translated/adapted version of the original Portuguese CALAME-PT benchmark.
- It contains 930 examples, so it is intended primarily for evaluation rather than large-scale training.
- Since this is a translated/adapted version, some examples may reflect translation choices or stylistic variation relative to the source dataset.
## Licensing
This dataset follows the same license as the original CALAME-PT dataset: MIT.
## Usage
Example with `datasets`:
```python
from datasets import load_dataset
ds = load_dataset("json", data_files="calame-gl.json")
print(ds["train"][0])
```
Example of accessing the context and final word:
```python
from datasets import load_dataset
ds = load_dataset("json", data_files="calame-gl.json")["train"]
print(ds[0]["sentence"])
print(ds[0]["last_word"])
```
## Acknowledgements
This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.
提供机构:
proxectonos



