GETALP/FLUE_VSD

Name: GETALP/FLUE_VSD
Creator: GETALP
Published: 2023-04-19 15:11:11
License: 暂无描述

Hugging Face2023-04-19 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/GETALP/FLUE_VSD

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: gpl-3.0 multilinguality: - monolingual language: - fr task_categories: - other task_ids: - word-sense-disambiguation dataset_info: features: - name: document_id dtype: string - name: sentence_id dtype: string - name: surface_forms sequence: string - name: fine_pos sequence: string - name: lemmas sequence: string - name: pos sequence: string - name: instance_surface_forms sequence: string - name: instance_fine_pos sequence: string - name: instance_lemmas sequence: string - name: instance_pos sequence: string splits: - name: FSE num_bytes: 2781427 num_examples: 3121 - name: wiki_FSE num_bytes: 43227879 num_examples: 58508 download_size: 0 dataset_size: 46009306 --- # FrenchSemEval ## Dataset Description - **Homepage:** - **Repository:** - **https://aclanthology.org/W19-0422.pdf** - **Leaderboard:** - **vincent.segonne@univ-grenoble-alpes.fr** ### Dataset Summary This dataset correspond to the FrenchSemEval, in which verb occurences where manually annotated with Wiktionary senses. ### Supported Tasks and Leaderboards Verb Sense Disambiguation for French verbs. ### Language French ## Dataset Structure ### Data Instances Each instance of the dataset has the following fields and these following types of field. ```json { "document_id": "d001", "sentence_id": "d001.s001", "surface_forms": ['Il', 'rend', 'hommage', 'au', 'roi', 'de', 'France', 'et', 'des', 'négociations', 'au', 'traité', 'du', 'Goulet', ',', 'formalisant', 'la', 'paix', 'entre', 'les', 'deux', 'pays', '.'], "fine_pos": ['CLS', 'V', 'NC', 'P+D', 'NC', 'P', 'NPP', 'CC', 'DET', 'NC', 'P+D', 'NC', 'P+D', 'NPP', 'PONCT', 'VPR', 'DET', 'NC', 'P', 'DET', 'ADJ', 'NC', 'PONCT'], "lemmas": ['il', 'rendre', 'hommage', 'à', 'roi', 'de', 'France', 'et', 'un', 'négociation', 'à', 'traité', 'de', 'Goulet', ',', 'formaliser', 'le', 'paix', 'entre', 'le', 'deux', 'pays', '.'], "pos": ['CL', 'V', 'N', 'P+D', 'N', 'P', 'N', 'C', 'D', 'N', 'P+D', 'N', 'P+D', 'N', 'PONCT', 'V', 'D', 'N', 'P', 'D', 'A', 'N', 'PONCT'], "instance_surface_forms":['aboutissent'], "instance_fine_pos":['V'], "instance_lemmas":['aboutir'], "instance_pos":['V'] } ``` ### Data Fields Each sentence has the following fields: **document_id**, **sentence_id**, **surface_forms**, **fine_pos**, **lemmas**, **pos**, **instance_surface_forms**, **instance_fine_pos**, **instance_lemmas**, **instance_pos**. ### Data Splits No splits provided. ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization To build the FrenchSemEval dataset, the authors focused on annotating moderately frequent and moderately ambiguous verbs by selecting verbs appearing between 50 and 1000 times into the French Wikipedia (2016-12-12 fr dump). For those verbs, the authors extracted 50 occurences with other annotations thanks to the French TreeBank [Abeillé and Barrier, 2004](http://ftb.linguist.univ-paris-diderot.fr/index.php?langue=en) and the Sequoia Treebank [Candito and Seddah, 2012](https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia). ### Annotations #### Annotation process To annotate FrenchSemEval, the annotators used [WebAnno](https://webanno.github.io/webanno/) an open-source adaptable annotation tool. Sentences have been pre-processed into CoNLL format and then annotated into WebAnno. The annotators where asked to only annotate marked occurences using the sense inventory from Wiktionnary. #### Who are the annotators? The annotation has been performed by 3 French students, with no prior experience in dataset annotation. ### Dataset statistics |Type|#| |---|---| |Number of sentences|3121| | Number of annoatated verb tokens | 3199 | | Number of annotated verb types | 66 | | Mean number of annotations per verb type | 48.47 | | Mean number of senses per verb type | 3.83 | ### Licensing Information ``` GNU Lesser General Public License ``` ### Citation Information ```bibtex @inproceedings{segonne-etal-2019-using, title = "Using {W}iktionary as a resource for {WSD} : the case of {F}rench verbs", author = "Segonne, Vincent and Candito, Marie and Crabb{\'e}, Beno{\^\i}t", booktitle = "Proceedings of the 13th International Conference on Computational Semantics - Long Papers", month = may, year = "2019", address = "Gothenburg, Sweden", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W19-0422", doi = "10.18653/v1/W19-0422", pages = "259--270", abstract = "As opposed to word sense induction, word sense disambiguation (WSD) has the advantage of us-ing interpretable senses, but requires annotated data, which are quite rare for most languages except English (Miller et al. 1993; Fellbaum, 1998). In this paper, we investigate which strategy to adopt to achieve WSD for languages lacking data that was annotated specifically for the task, focusing on the particular case of verb disambiguation in French. We first study the usability of Eurosense (Bovi et al. 2017) , a multilingual corpus extracted from Europarl (Kohen, 2005) and automatically annotated with BabelNet (Navigli and Ponzetto, 2010) senses. Such a resource opened up the way to supervised and semi-supervised WSD for resourceless languages like French. While this perspective looked promising, our evaluation on French verbs was inconclusive and showed the annotated senses{'} quality was not sufficient for supervised WSD on French verbs. Instead, we propose to use Wiktionary, a collaboratively edited, multilingual online dictionary, as a resource for WSD. Wiktionary provides both sense inventory and manually sense tagged examples which can be used to train supervised and semi-supervised WSD systems. Yet, because senses{'} distribution differ in lexicographic examples found in Wiktionary with respect to natural text, we then focus on studying the impact on WSD of the training data size and senses{'} distribution. Using state-of-the art semi-supervised systems, we report experiments of Wiktionary-based WSD for French verbs, evaluated on FrenchSemEval (FSE), a new dataset of French verbs manually annotated with wiktionary senses.", } ``` ### Contributions * vincent.segonne@univ-grenoble-alpes.fr * marie.candito@linguist.univ-paris-diderot.fr * benoit.crabbe@linguist.univ-paris-diderot.fr

提供机构：

GETALP

原始信息汇总

数据集概述

数据集名称

FrenchSemEval

数据集描述

该数据集对应于FrenchSemEval，其中动词出现次数被手动注释为Wiktionary的含义。

语言

法语

支持的任务

法语动词的词义消歧

数据集结构

数据实例
- 每个实例包含以下字段：document_id, sentence_id, surface_forms, fine_pos, lemmas, pos, instance_surface_forms, instance_fine_pos, instance_lemmas, instance_pos。

数据字段

document_id (字符串类型)
sentence_id (字符串类型)
surface_forms (字符串序列)
fine_pos (字符串序列)
lemmas (字符串序列)
pos (字符串序列)
instance_surface_forms (字符串序列)
instance_fine_pos (字符串序列)
instance_lemmas (字符串序列)
instance_pos (字符串序列)

数据分割

FSE
- 字节数: 2781427
- 示例数: 3121
wiki_FSE
- 字节数: 43227879
- 示例数: 58508

数据集创建

源数据收集和规范化
- 数据集构建集中于注释中等频率和中等歧义的动词，选择出现在法语维基百科（2016-12-12 fr dump）中的动词，出现次数在50到1000次之间。
注释过程
- 使用WebAnno作为注释工具，句子预处理为CoNLL格式，然后使用WebAnno进行注释。
- 注释者为3名没有数据集注释经验的法语学生。

数据集统计

句子数量：3121
注释的动词令牌数量：3199
注释的动词类型数量：66
每种动词类型的平均注释数量：48.47
每种动词类型的平均含义数量：3.83

许可证信息

GNU Lesser General Public License

引用信息

bibtex @inproceedings{segonne-etal-2019-using, title = "Using {W}iktionary as a resource for {WSD} : the case of {F}rench verbs", author = "Segonne, Vincent and Candito, Marie and Crabb{e}, Beno{^i}t", booktitle = "Proceedings of the 13th International Conference on Computational Semantics - Long Papers", month = may, year = "2019", address = "Gothenburg, Sweden", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W19-0422", doi = "10.18653/v1/W19-0422", pages = "259--270", abstract = "As opposed to word sense induction, word sense disambiguation (WSD) has the advantage of us-ing interpretable senses, but requires annotated data, which are quite rare for most languages except English (Miller et al. 1993; Fellbaum, 1998). In this paper, we investigate which strategy to adopt to achieve WSD for languages lacking data that was annotated specifically for the task, focusing on the particular case of verb disambiguation in French. We first study the usability of Eurosense (Bovi et al. 2017) , a multilingual corpus extracted from Europarl (Kohen, 2005) and automatically annotated with BabelNet (Navigli and Ponzetto, 2010) senses. Such a resource opened up the way to supervised and semi-supervised WSD for resourceless languages like French. While this perspective looked promising, our evaluation on French verbs was inconclusive and showed the annotated senses{} quality was not sufficient for supervised WSD on French verbs. Instead, we propose to use Wiktionary, a collaboratively edited, multilingual online dictionary, as a resource for WSD. Wiktionary provides both sense inventory and manually sense tagged examples which can be used to train supervised and semi-supervised WSD systems. Yet, because senses{} distribution differ in lexicographic examples found in Wiktionary with respect to natural text, we then focus on studying the impact on WSD of the training data size and senses{} distribution. Using state-of-the art semi-supervised systems, we report experiments of Wiktionary-based WSD for French verbs, evaluated on FrenchSemEval (FSE), a new dataset of French verbs manually annotated with wiktionary senses.", }

贡献者

vincent.segonne@univ-grenoble-alpes.fr
marie.candito@linguist.univ-paris-diderot.fr
benoit.crabbe@linguist.univ-paris-diderot.fr

5,000+

优质数据集

54 个

任务类型

进入经典数据集