proxectonos/galcola

Name: proxectonos/galcola
Creator: proxectonos
Published: 2026-04-24 12:39:07
License: 暂无描述

Hugging Face2026-04-24 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/proxectonos/galcola

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - gl pretty_name: GalCoLA license: cc-by-4.0 task_categories: - text-classification task_ids: - acceptability-classification tags: - galician - grammar - acceptability - cola - syntax - evaluation - nlp size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: "train.tsv" - split: validation path: "dev.tsv" - split: test path: "test.tsv" --- # GalCoLA ## Dataset Summary GalCoLA is a Galician grammatical acceptability dataset in TSV format. It is designed for sentence-level binary classification, where each sentence is labeled as grammatically correct or grammatically incorrect. The dataset brings together CoLA-style adaptations of Galician materials from two previous research settings: - targeted syntactic evaluation datasets from **PROPOR 2022** - control dependency datasets from **ACL 2023** GalCoLA contains **17,088 sentences** in total. ## Dataset Structure The dataset is distributed in **TSV format** and includes three splits: - `train` - `validation` - `test` Each row contains the following columns: - `paper_id`: identifier of the source publication or experimental setting - `source_type`: identifier of the source subset or linguistic phenomenon - `source_id`: identifier of the original example - `condition`: sentence condition - `a` = grammatically correct - `b` = grammatically incorrect - `sentence`: sentence in Galician - `label`: binary acceptability label - `1` = grammatically correct - `0` = grammatically incorrect ### Example | paper_id | source_type | source_id | condition | sentence | label | |----------|-------------|-----------|-----------|----------|------:| | PROPOR2022 | PER_1 | 1 | a | Cociñei o peixe para o comeres tu. | 1 | | PROPOR2022 | NUM_7545 | 3773 | a | Os nenos que xogaban onte alí coa outra cativa presentan mal o acto. | 1 | | PROPOR2022 | NUM_4719 | 2360 | a | As rapazas que xogaban onte alí teñen fame. | 1 | ## Dataset Creation GalCoLA combines materials adapted to a CoLA-style acceptability format from two main sources. ### PROPOR 2022 (A Targeted Assessment of the Syntactic Abilities of Transformer Models for Galician-Portuguese) This part of the dataset comes from targeted syntactic evaluation materials for Galician. It includes controlled examples focused on: - gender agreement - number agreement - person agreement The original evaluation items were converted into sentence-level acceptability pairs, where one sentence is grammatically correct and the other is incorrect. Total number of rows from this source: **15,552**. ### ACL 2023 (Dependency resolution at the syntax-semantics interface: psycholinguistic and computational insights on control dependencies) This part of the dataset comes from experiments on control dependencies in Galician. It includes acceptability materials based on: - proper names - pronouns These materials were also adapted into a CoLA-style binary classification format. Total number of rows from this source: **1,536**. ## Dataset Statistics - PROPOR 2022 adaptation: **15,552** - ACL 2023 adaptation: **1,536** - **Total**: **17,088** ## Labels | Label | Meaning | |------:|---------| | 0 | grammatically incorrect | | 1 | grammatically correct | ## Intended Uses GalCoLA can be used for: - grammatical acceptability classification in Galician - evaluation of syntactic agreement phenomena - probing morphosyntactic abilities of language models - low-resource NLP research for Galician ## Limitations - The dataset is focused on targeted grammatical phenomena and does not cover all types of acceptability judgments in Galician. - Many examples are controlled or semi-synthetic adaptations derived from experimental materials. - The dataset is mainly intended for evaluation and analysis rather than broad-coverage training. ## License GalCoLA is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. Users are free to share and adapt the material, provided that appropriate credit is given to the original source. ## Usage Example with `datasets`: ```python from datasets import load_dataset ds = load_dataset("proxectonos/galcola") print(ds["train"][0]) print(ds["validation"][0]) print(ds["test"][0]) ``` ## Acknowledgements This dataset was compiled within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215336. ## Citation If you use this dataset, please cite the original sources. ```bibtex @inproceedings{garcia-crespo2022-targeted, title = {A Targeted Assessment of the Syntactic Abilities of Transformer Models for Galician-Portuguese}, author = {Garcia, Marcos and Crespo-Otero, Alfredo}, booktitle = {Proceedings of the International Conference on the Computational Processing of Portuguese (PROPOR 2022)}, year = {2022}, publisher = {Springer Nature}, series = {Lecture Notes in Artificial Intelligence} } @inproceedings{de-dios-flores-etal-2023-control, author = {de-Dios-Flores, Iria and García Amboage, Juan Pablo and Garcia, Marcos}, title = {Dependency resolution at the syntax-semantics interface: psycholinguistic and computational insights on control dependencies}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics}, year = {2023} } ```

提供机构：

proxectonos

5,000+

优质数据集

54 个

任务类型

进入经典数据集