用于语料库基词集扩展算法内在评估的数据集

Name: 用于语料库基词集扩展算法内在评估的数据集
Creator: 英特尔AI实验室
Published: 2019-04-10 16:51:49
License: 暂无描述

arXiv2019-04-10 更新2024-06-21 收录

下载链接：

http://nlp_architect.nervanasys.com/

下载链接

链接失效反馈

官方服务：

资源简介：

本数据集由英特尔AI实验室创建，专门用于评估语料库基词集扩展算法的性能。数据集包含28个经过手动验证的词列表，这些列表来自英文维基百科的‘List of’页面，涵盖了从特定到通用的多种语义类别。数据集的创建过程涉及从维基百科中提取词列表和文本数据，用于无监督学习的多上下文词嵌入训练。该数据集的应用领域主要集中在计算语义任务，如词集扩展，旨在通过算法提高词集扩展的准确性和效率。

This dataset was developed by Intel AI Lab, specifically designed to evaluate the performance of corpus-based lexical set expansion algorithms. It contains 28 manually validated word lists sourced from the "List of" pages of English Wikipedia, covering a diverse range of semantic categories spanning from specific to general domains. The dataset creation workflow involves extracting word lists and textual data from Wikipedia for training multi-context word embeddings in unsupervised learning scenarios. Its primary application areas focus on computational semantics tasks such as lexical set expansion, with the goal of enhancing the accuracy and efficiency of lexical set expansion through algorithmic approaches.

提供机构：

英特尔AI实验室

创建时间：

2019-04-04