GETALP/FLUE_VSD
收藏数据集概述
数据集名称
- FrenchSemEval
数据集描述
- 该数据集对应于FrenchSemEval,其中动词出现次数被手动注释为Wiktionary的含义。
语言
- 法语
支持的任务
- 法语动词的词义消歧
数据集结构
- 数据实例
- 每个实例包含以下字段:document_id, sentence_id, surface_forms, fine_pos, lemmas, pos, instance_surface_forms, instance_fine_pos, instance_lemmas, instance_pos。
数据字段
- document_id (字符串类型)
- sentence_id (字符串类型)
- surface_forms (字符串序列)
- fine_pos (字符串序列)
- lemmas (字符串序列)
- pos (字符串序列)
- instance_surface_forms (字符串序列)
- instance_fine_pos (字符串序列)
- instance_lemmas (字符串序列)
- instance_pos (字符串序列)
数据分割
- FSE
- 字节数: 2781427
- 示例数: 3121
- wiki_FSE
- 字节数: 43227879
- 示例数: 58508
数据集创建
- 源数据收集和规范化
- 数据集构建集中于注释中等频率和中等歧义的动词,选择出现在法语维基百科(2016-12-12 fr dump)中的动词,出现次数在50到1000次之间。
- 注释过程
- 使用WebAnno作为注释工具,句子预处理为CoNLL格式,然后使用WebAnno进行注释。
- 注释者为3名没有数据集注释经验的法语学生。
数据集统计
- 句子数量:3121
- 注释的动词令牌数量:3199
- 注释的动词类型数量:66
- 每种动词类型的平均注释数量:48.47
- 每种动词类型的平均含义数量:3.83
许可证信息
- GNU Lesser General Public License
引用信息
bibtex @inproceedings{segonne-etal-2019-using, title = "Using {W}iktionary as a resource for {WSD} : the case of {F}rench verbs", author = "Segonne, Vincent and Candito, Marie and Crabb{e}, Beno{^i}t", booktitle = "Proceedings of the 13th International Conference on Computational Semantics - Long Papers", month = may, year = "2019", address = "Gothenburg, Sweden", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W19-0422", doi = "10.18653/v1/W19-0422", pages = "259--270", abstract = "As opposed to word sense induction, word sense disambiguation (WSD) has the advantage of us-ing interpretable senses, but requires annotated data, which are quite rare for most languages except English (Miller et al. 1993; Fellbaum, 1998). In this paper, we investigate which strategy to adopt to achieve WSD for languages lacking data that was annotated specifically for the task, focusing on the particular case of verb disambiguation in French. We first study the usability of Eurosense (Bovi et al. 2017) , a multilingual corpus extracted from Europarl (Kohen, 2005) and automatically annotated with BabelNet (Navigli and Ponzetto, 2010) senses. Such a resource opened up the way to supervised and semi-supervised WSD for resourceless languages like French. While this perspective looked promising, our evaluation on French verbs was inconclusive and showed the annotated senses{} quality was not sufficient for supervised WSD on French verbs. Instead, we propose to use Wiktionary, a collaboratively edited, multilingual online dictionary, as a resource for WSD. Wiktionary provides both sense inventory and manually sense tagged examples which can be used to train supervised and semi-supervised WSD systems. Yet, because senses{} distribution differ in lexicographic examples found in Wiktionary with respect to natural text, we then focus on studying the impact on WSD of the training data size and senses{} distribution. Using state-of-the art semi-supervised systems, we report experiments of Wiktionary-based WSD for French verbs, evaluated on FrenchSemEval (FSE), a new dataset of French verbs manually annotated with wiktionary senses.", }
贡献者
- vincent.segonne@univ-grenoble-alpes.fr
- marie.candito@linguist.univ-paris-diderot.fr
- benoit.crabbe@linguist.univ-paris-diderot.fr



