five

proxectonos/calame-gl

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/calame-gl
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - gl pretty_name: calame-gl task_categories: - text-generation task_ids: - language-modeling tags: - galician - evaluation - benchmark - language-modeling - text-completion - calame license: mit size_categories: - 1K<n<10K --- # CALAME Galician ## Dataset description CALAME-gl is a Galician translation/adaptation of the Portuguese [CALAME-PT](https://huggingface.co/datasets/NOVA-vision-language/calame-pt) benchmark. The dataset is composed of short texts or contexts and their respective last words. These contexts are designed to contain enough information for a human or a model to infer the final word, while avoiding contexts that are excessively specific or overly ambiguous. This release contains 930 instances in JSON format and is intended primarily for evaluation. ## Dataset structure The dataset is distributed in JSON format as a list of examples. Each instance contains the following fields: - `id`: example identifier - `sentence`: context in Galician - `last_word`: final word associated with the context ### Example ```json { "id": 0, "sentence": "Os fans de GTA están ansiosos polo lanzamento do próximo xogo da serie, cuxo lanzamento pódese demorar algúns anos máis. Os rumores apuntan a que o GTA VI será unha versión moderna de Vice City e contará cun mapa que muda co paso do tempo. Alén diso, existe a posibilidade dunha protagonista feminina, o que trae máis expectativas ao xogo. Mentres agardamos, queda imaxinar o que esa nova aventura nos", "last_word": "depara" } ``` ## Data source and creation This dataset is based on the Portuguese benchmark [CALAME-PT](https://huggingface.co/datasets/NOVA-vision-language/calame-pt) and was translated/adapted into Galician. The Galician version preserves the same evaluation-oriented structure as the original dataset: each example contains a context and its corresponding final word. The goal of this version is to provide a Galician benchmark for evaluating a model's ability to infer or predict the final word of a context. ## Intended uses This dataset can be used for: - evaluation of language models in Galician - text completion evaluation - last-word prediction tasks - low-resource NLP research ## Limitations - This dataset is a translated/adapted version of the original Portuguese CALAME-PT benchmark. - It contains 930 examples, so it is intended primarily for evaluation rather than large-scale training. - Since this is a translated/adapted version, some examples may reflect translation choices or stylistic variation relative to the source dataset. ## Licensing This dataset follows the same license as the original CALAME-PT dataset: MIT. ## Usage Example with `datasets`: ```python from datasets import load_dataset ds = load_dataset("json", data_files="calame-gl.json") print(ds["train"][0]) ``` Example of accessing the context and final word: ```python from datasets import load_dataset ds = load_dataset("json", data_files="calame-gl.json")["train"] print(ds[0]["sentence"]) print(ds[0]["last_word"]) ``` ## Acknowledgements This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.
提供机构:
proxectonos
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作