proxectonos/galcola
收藏Hugging Face2026-04-24 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/proxectonos/galcola
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- gl
pretty_name: GalCoLA
license: cc-by-4.0
task_categories:
- text-classification
task_ids:
- acceptability-classification
tags:
- galician
- grammar
- acceptability
- cola
- syntax
- evaluation
- nlp
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: "train.tsv"
- split: validation
path: "dev.tsv"
- split: test
path: "test.tsv"
---
# GalCoLA
## Dataset Summary
GalCoLA is a Galician grammatical acceptability dataset in TSV format. It is designed for sentence-level binary classification, where each sentence is labeled as grammatically correct or grammatically incorrect.
The dataset brings together CoLA-style adaptations of Galician materials from two previous research settings:
- targeted syntactic evaluation datasets from **PROPOR 2022**
- control dependency datasets from **ACL 2023**
GalCoLA contains **17,088 sentences** in total.
## Dataset Structure
The dataset is distributed in **TSV format** and includes three splits:
- `train`
- `validation`
- `test`
Each row contains the following columns:
- `paper_id`: identifier of the source publication or experimental setting
- `source_type`: identifier of the source subset or linguistic phenomenon
- `source_id`: identifier of the original example
- `condition`: sentence condition
- `a` = grammatically correct
- `b` = grammatically incorrect
- `sentence`: sentence in Galician
- `label`: binary acceptability label
- `1` = grammatically correct
- `0` = grammatically incorrect
### Example
| paper_id | source_type | source_id | condition | sentence | label |
|----------|-------------|-----------|-----------|----------|------:|
| PROPOR2022 | PER_1 | 1 | a | Cociñei o peixe para o comeres tu. | 1 |
| PROPOR2022 | NUM_7545 | 3773 | a | Os nenos que xogaban onte alí coa outra cativa presentan mal o acto. | 1 |
| PROPOR2022 | NUM_4719 | 2360 | a | As rapazas que xogaban onte alí teñen fame. | 1 |
## Dataset Creation
GalCoLA combines materials adapted to a CoLA-style acceptability format from two main sources.
### PROPOR 2022 (A Targeted Assessment of the Syntactic Abilities of Transformer Models for Galician-Portuguese)
This part of the dataset comes from targeted syntactic evaluation materials for Galician. It includes controlled examples focused on:
- gender agreement
- number agreement
- person agreement
The original evaluation items were converted into sentence-level acceptability pairs, where one sentence is grammatically correct and the other is incorrect.
Total number of rows from this source: **15,552**.
### ACL 2023 (Dependency resolution at the syntax-semantics interface: psycholinguistic and computational insights on control dependencies)
This part of the dataset comes from experiments on control dependencies in Galician. It includes acceptability materials based on:
- proper names
- pronouns
These materials were also adapted into a CoLA-style binary classification format.
Total number of rows from this source: **1,536**.
## Dataset Statistics
- PROPOR 2022 adaptation: **15,552**
- ACL 2023 adaptation: **1,536**
- **Total**: **17,088**
## Labels
| Label | Meaning |
|------:|---------|
| 0 | grammatically incorrect |
| 1 | grammatically correct |
## Intended Uses
GalCoLA can be used for:
- grammatical acceptability classification in Galician
- evaluation of syntactic agreement phenomena
- probing morphosyntactic abilities of language models
- low-resource NLP research for Galician
## Limitations
- The dataset is focused on targeted grammatical phenomena and does not cover all types of acceptability judgments in Galician.
- Many examples are controlled or semi-synthetic adaptations derived from experimental materials.
- The dataset is mainly intended for evaluation and analysis rather than broad-coverage training.
## License
GalCoLA is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
Users are free to share and adapt the material, provided that appropriate credit is given to the original source.
## Usage
Example with `datasets`:
```python
from datasets import load_dataset
ds = load_dataset("proxectonos/galcola")
print(ds["train"][0])
print(ds["validation"][0])
print(ds["test"][0])
```
## Acknowledgements
This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336.
## Citation
If you use this dataset, please cite the original sources.
```bibtex
@inproceedings{garcia-crespo2022-targeted,
title = {A Targeted Assessment of the Syntactic Abilities of Transformer Models for Galician-Portuguese},
author = {Garcia, Marcos and Crespo-Otero, Alfredo},
booktitle = {Proceedings of the International Conference on the Computational Processing of Portuguese (PROPOR 2022)},
year = {2022},
publisher = {Springer Nature},
series = {Lecture Notes in Artificial Intelligence}
}
@inproceedings{de-dios-flores-etal-2023-control,
author = {de-Dios-Flores, Iria and García Amboage, Juan Pablo and Garcia, Marcos},
title = {Dependency resolution at the syntax-semantics interface: psycholinguistic and computational insights on control dependencies},
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics},
year = {2023}
}
```
提供机构:
proxectonos



