The Pilot Corpus of the English Semantic Sketches

Name: The Pilot Corpus of the English Semantic Sketches
Creator: ABBYY Moscow, Russia
Published: 2025-05-23 18:53:00
License: 暂无描述

arXiv2025-05-23 更新2025-05-28 收录

下载链接：

http://arxiv.org/abs/2505.17733v1

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为“英语语义草图试点语料库”，由ABBYY Moscow创建。数据集包含100个经过人工检查的英语草图，每个草图都有一个对应的俄语草图，共113对英俄草图。草图基于包含14百万句法动词链接的英语文本语料库构建，旨在展示不同语义草图之间的对比研究，并解决词义消歧问题。数据集的创建过程涉及语义解析和分类，以及人工选择草图以确保其质量。该数据集可用于对比研究，以分析不同语言中具有相似语义的草图之间的差异，以及动词的多义性和不对称兼容性问题。

This dataset, titled 'English Semantic Sketch Pilot Corpus', was developed by ABBYY Moscow. It includes 100 manually verified English sketches, each paired with a matching Russian sketch, resulting in a total of 113 English-Russian sketch pairs. The sketches are built upon an English text corpus containing 14 million syntactic verb links. The primary objectives of this dataset are to facilitate comparative research on diverse semantic sketches and to tackle the problem of word sense disambiguation. The dataset construction process encompasses semantic parsing and classification, alongside manual curation of sketches to guarantee their quality. This resource can be employed for comparative studies to examine disparities between semantically analogous sketches across languages, as well as to investigate issues concerning verb polysemy and asymmetric compatibility.

提供机构：

ABBYY Moscow, Russia

创建时间：

2025-05-23

搜集汇总

数据集介绍

构建方式

该数据集的构建基于大规模英语文本语料库，涵盖技术文档、新闻、小说等多种文体，共包含1400万条动词相关的句法链接。通过Compreno语义解析器对文本进行深度语义标注，提取动词的语义角色及其典型填充词，形成语义草图。构建过程中特别关注多义动词的处理，每个动词需满足至少200条语义链接的阈值，确保数据代表性。最终人工筛选出100个高质量的英语语义草图，并与俄语对应版本形成双语对照语料库。

特点

该数据集的核心特征在于其创新的语义草图表示方法，通过语义角色框架系统化呈现动词的典型搭配模式。每个草图清晰展示动词不同义项下的高频语义依赖关系，有效区分多义词的不同用法。双语对照结构为跨语言研究提供独特视角，可直观比较英语和俄语中相似语义动词的搭配差异。数据集特别关注多义动词的处理，包含113对英俄语义草图，其中84个俄语草图对应多个英语变体，生动体现了语言间的非对称对应关系。

使用方法

该数据集适用于自然语言处理领域的词义消歧、跨语言对比研究和词典编纂等任务。研究者可通过分析语义角色框架中的典型填充词，建立动词的分布语义模型。双语对照结构支持两种使用模式：纵向分析单个语言的动词搭配特征，或横向比较英俄动词的语义角色差异。数据集提供的JSON格式标注包含动词的语义类、角色框架及典型填充词频率，可直接用于机器学习模型的训练。为保障研究可复现性，建议配合Compreno语义标注体系说明文档使用。

背景与挑战

背景概述

The Pilot Corpus of the English Semantic Sketches (SemSketches) was introduced in 2025 by researchers from ABBYY Moscow, HSE, and RSUH, including Maria Petrova, Maria Ponomareva, and Alexandra Ivoylova. This dataset focuses on creating semantic sketches for English verbs, which serve as lexicographical portraits built from large text corpora. These sketches capture the most frequent semantic dependencies of verbs, addressing the challenge of polysemy and cross-linguistic differences by pairing English sketches with their Russian counterparts. The corpus, initially comprising 100 manually verified English-Russian sketch pairs, aims to facilitate contrastive studies and improve natural language processing (NLP) tasks such as word sense disambiguation (WSD) and semantic role labeling. The work builds on earlier efforts in Russian lexicography and leverages the Compreno semantic parser, which provides detailed semantic mark-up for verb dependencies.

当前挑战

The SemSketches corpus faces several challenges. Firstly, in addressing polysemy, the dataset must accurately differentiate between multiple meanings of verbs, a task complicated by syntactic homonymy and varying semantic roles across languages. Secondly, the construction process encounters technical hurdles, such as 'empty' sketches due to insufficient textual links or narrow semantic role fillers. Additionally, parser errors, including incorrect semantic role assignments or filler selections, impact sketch accuracy. Cross-linguistic discrepancies further complicate the task, as equivalent verbs in English and Russian often exhibit divergent semantic roles or fillers, reflecting deeper structural differences in how languages encode meaning. These challenges highlight the need for larger, more representative corpora and refined parsing techniques to enhance the reliability and utility of semantic sketches.

常用场景

经典使用场景

在自然语言处理领域，The Pilot Corpus of the English Semantic Sketches数据集为动词的语义表征研究提供了重要资源。该数据集通过构建英语动词的语义草图，展示了动词在不同上下文中的典型依赖关系，特别关注多义词的语义区分。研究者可利用这些语义草图分析动词的语义角色及其典型填充词，探究动词在不同语境下的使用模式。数据集采用英俄双语对照设计，为跨语言语义对比研究提供了独特视角。

解决学术问题

该数据集有效解决了词义消歧、多义词表征等核心语言学问题。通过语义角色标注和典型依赖关系统计，数据集提供了可解释的动词语义表征方法，弥补了传统词向量模型在解释性方面的不足。数据集特别关注跨语言语义差异，揭示了英语和俄语中相似语义动词的搭配偏好差异，为对比语言学研究提供了量化依据。在计算词典学领域，该数据集的语义草图方法为自动化词典编纂提供了新思路。

衍生相关工作

该数据集衍生了一系列重要研究工作，包括基于语义草图的词义消歧系统开发、跨语言动词搭配对比分析框架构建等。数据集的方法论启发了后续语义角色标注研究，如扩展至其他词类的语义草图构建。在资源建设方面，推动了多语言语义草图语料库的扩展，如后续开发的俄语语义草图资源。数据集的标注体系也为其他语义解析任务提供了参考标准。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集