mbruton/galician_srl

Name: mbruton/galician_srl
Creator: mbruton
Published: 2024-01-03 14:08:08
License: 暂无描述

Hugging Face2024-01-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/mbruton/galician_srl

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: tokens sequence: string - name: tags sequence: class_label: names: '0': O '1': r0:arg0 '2': r0:arg1 '3': r0:arg2 '4': r0:root '5': r10:arg0 '6': r10:arg1 '7': r10:root '8': r11:arg0 '9': r11:root '10': r12:arg1 '11': r12:root '12': r13:arg1 '13': r13:root '14': r1:arg0 '15': r1:arg1 '16': r1:arg2 '17': r1:root '18': r2:arg0 '19': r2:arg1 '20': r2:arg2 '21': r2:root '22': r3:arg0 '23': r3:arg1 '24': r3:arg2 '25': r3:root '26': r4:arg0 '27': r4:arg1 '28': r4:arg2 '29': r4:root '30': r5:arg0 '31': r5:arg1 '32': r5:arg2 '33': r5:root '34': r6:arg0 '35': r6:arg1 '36': r6:arg2 '37': r6:root '38': r7:arg0 '39': r7:arg1 '40': r7:arg2 '41': r7:root '42': r8:arg0 '43': r8:arg1 '44': r8:arg2 '45': r8:root '46': r9:arg0 '47': r9:arg1 '48': r9:arg2 '49': r9:root - name: ids dtype: int64 splits: - name: train num_bytes: 2241310 num_examples: 3986 - name: test num_bytes: 555760 num_examples: 997 download_size: 675236 dataset_size: 2797070 license: apache-2.0 task_categories: - token-classification language: - gl pretty_name: GalicianSRL size_categories: - 1K<n<10K --- # Dataset Card for GalicianSRL ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Limitations](#limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Citation Information](#citation-information) ## Dataset Description - **Repository:** [GalicianSRL Project Hub](https://github.com/mbruton0426/GalicianSRL) - **Paper:** To be updated - **Point of Contact:** [Micaella Bruton](mailto:micaellabruton@gmail.com) ### Dataset Summary The GalicianSRL dataset is a Galician-language dataset of tokenized sentences and the semantic role for each token within a sentence. Semantic roles are limited to verbal roots, argument 0, argument 1, and argument 2. This dataset was created to support the task of semantic role labeling in the Galician language, as no publically available datasets existed as of the date of publication to the contributor's knowledge. ### Languages The text in the dataset is in Galician. ## Dataset Structure ### Data Instances A typical data point comprises a tokenized sentence, tags for each token, and a sentence id number. An example from the GalicianSRL dataset looks as follows: ``` {'tokens': ['O', 'Pleno', 'poderá', ',', 'con', 'todo', ',', 'avocar', 'en', 'calquera', 'momento', 'o', 'debate', 'e', 'votación', 'de', 'calquera', 'proxecto', 'ou', 'proposición', 'de', 'lei', 'que', 'xa', 'fora', 'obxecto', 'de', 'esta', 'delegación', '.'], 'tags': [0, 1, 4, 0, 0, 0, 0, 17, 0, 0, 16, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'ids': 504} ``` Tags are assigned an id number according to the index of its label as listed in: ```python >>> dataset['train'].features['tags'].feature.names ``` ### Data Fields - `tokens`: a list of strings - `tags`: a list of integers - `ids`: a sentence id, as an integer ### Data Splits The data is split into a training and test set. The final structure and split sizes are as follow: ``` DatasetDict({ train: Dataset({ features: ['tokens', 'tags', 'ids'], num_rows: 1005 }) test: Dataset({ features: ['tokens', 'tags', 'ids'], num_rows: 252 }) }) ``` ## Dataset Creation ### Curation Rationale GalicianSRL was built to provide a dataset for semantic role labeling in Galician and expand NLP resources available for the Galician language. ### Source Data #### Initial Data Collection and Normalization Data was collected from both the [CTG UD annotated corpus](https://github.com/UniversalDependencies/UD_Galician-CTG) and the [TreeGal UD annotated corpus](https://github.com/UniversalDependencies/UD_Galician-TreeGal), and combined to collect the requsite information for this task. For more information, please refer to the publication listed in the citation. ## Considerations for Using the Data ### Limitations The purpose of this dataset is to help develop a working semantic role labeling system for Galician, as SRL systems have been shown to improve a variety of NLP tasks. It should be noted however that Galician is considered a low-resource language at this time, and as such the dataset has an extrememly limited scope. This dataset would benefit from manual validation of a native speaker of Galician, the inclusion of additional sentences, and an extention of arguments past arg0, arg1, and arg2. ## Additional Information ### Dataset Curators The dataset was created by Micaella Bruton, as part of her Master's thesis. ### Citation Information ``` @inproceedings{bruton-beloucif-2023-bertie, title = "{BERT}ie Bott{'}s Every Flavor Labels: A Tasty Introduction to Semantic Role Labeling for {G}alician", author = "Bruton, Micaella and Beloucif, Meriem", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.671", doi = "10.18653/v1/2023.emnlp-main.671", pages = "10892--10902", abstract = "In this paper, we leverage existing corpora, WordNet, and dependency parsing to build the first Galician dataset for training semantic role labeling systems in an effort to expand available NLP resources. Additionally, we introduce verb indexing, a new pre-processing method, which helps increase the performance when semantically parsing highly-complex sentences. We use transfer-learning to test both the resource and the verb indexing method. Our results show that the effects of verb indexing were amplified in scenarios where the model was both pre-trained and fine-tuned on datasets utilizing the method, but improvements are also noticeable when only used during fine-tuning. The best-performing Galician SRL model achieved an f1 score of 0.74, introducing a baseline for future Galician SRL systems. We also tested our method on Spanish where we achieved an f1 score of 0.83, outperforming the baseline set by the 2009 CoNLL Shared Task by 0.025 showing the merits of our verb indexing method for pre-processing.", } ```

提供机构：

mbruton

原始信息汇总

数据集概述

数据集名称

GalicianSRL

数据集语言

Galician

数据集任务类别

Token-Classification

数据集大小

1K<n<10K

数据集特征

tokens: 字符串序列
tags: 整数序列，包含多个标签，如O, r0:arg0, r0:arg1等
ids: 整数类型，表示句子ID

数据集分割

train: 3986个样本，总大小2241310字节
test: 997个样本，总大小555760字节

许可证

Apache-2.0

数据集创建理由

为Galician语言提供语义角色标注数据集，扩展NLP资源。

数据集来源

数据来源于CTG UD annotated corpus和TreeGal UD annotated corpus。

数据集限制

由于Galician是低资源语言，数据集范围有限，建议由Galician母语者进行手动验证，并增加样本和扩展语义角色。

5,000+

优质数据集

54 个

任务类型

进入经典数据集