SWORDS

Name: SWORDS
Creator: 斯坦福大学
Published: 2021-06-13 02:42:40
License: 暂无描述

arXiv2021-06-13 更新2024-07-25 收录

下载链接：

https://github.com/p-lambda/swords

下载链接

链接失效反馈

官方服务：

资源简介：

SWORDS数据集是由斯坦福大学创建的一个英语词汇替换基准，旨在提高覆盖率和质量。该数据集包含1132个上下文和目标词对，总计68,683个替代词，每个替代词都有一个人类评分的适当性得分。SWORDS通过结合COINCO基准和无上下文同义词库的候选词，增加了可接受的替代词的覆盖率，并通过多个人类注释者的二元标签来确定每个替代词的适当性。该数据集主要用于评估预训练语言模型在词汇替换任务上的性能，特别是在写作辅助场景中提供有用替代词的能力。

The SWORDS dataset is an English lexical substitution benchmark created by Stanford University, aiming to improve coverage and quality. This dataset contains 1,132 context-target word pairs, totaling 68,683 substitute words, with each substitute assigned a human-rated appropriateness score. To expand the coverage of acceptable substitutes, SWORDS combines candidate words from the COINCO benchmark and context-free thesauruses, and determines the appropriateness of each substitute using binary labels from multiple human annotators. This dataset is primarily used to evaluate the performance of pre-trained language models on lexical substitution tasks, especially their ability to provide useful substitute words in writing assistance scenarios.

提供机构：

斯坦福大学

创建时间：

2021-06-08

原始信息汇总

Swords ⚔️: Stanford Word Substitution Benchmark

数据集概述

Swords ⚔️ 是一个用于词汇替代任务的基准测试，旨在找到上下文中目标词的适当替代词。例如：

Context: My favorite thing about her is her straightforward honesty.

Target word: straightforward

Substitutes: sincere, genuine, frank, candid, direct, forthright, ...

数据集下载

Swords ⚔️ 的开发集和测试集可以从以下链接下载：

Swords ⚔️ 采用 CC-BY-3.0-US 许可。该基准包括来自 CoInCo 基准和 MASC 语料库的内容，这些内容均采用相同的许可。

数据格式

基准数据以简单的 JSON 格式分发，包含所有 contexts、targets 和 substitutes 作为键。每个 context/target/substitute 都关联一个唯一的 ID，这是一个其内容的 SHA1 哈希值。内容如下：

每个 context 包含：
- context: 上下文文本
- extra: 关于此上下文的额外信息
每个 target 包含：
- context_id: 此目标来自的上下文的 ID
- target: 目标词文本
- offset: 目标词在其上下文中的字符级整数偏移量
- pos: 目标词的词性
- extra: 关于此目标的额外信息
每个 substitute 包含：
- target_id: 此替代词对应的目标的 ID
- substitute: 替代词文本
- extra: 关于此替代词的额外信息

每个替代词 ID 的标签位于 substitute_labels 键中。

示例代码

以下是读取此格式的示例 Python 代码：

python from collections import defaultdict import gzip import json

Load benchmark

with gzip.open(swords-v1.1_dev.json.gz, r) as f: swords = json.load(f)

Gather substitutes by target

tid_to_sids = defaultdict(list) for sid, substitute in swords[substitutes].items(): tid_to_sids[substitute[target_id]].append(sid)

Iterate through targets

for tid, target in swords[targets].items(): context = swords[contexts][target[context_id]] substitutes = [swords[substitutes][sid] for sid in tid_to_sids[tid]] labels = [swords[substitute_labels][sid] for sid in tid_to_sids[tid]] scores = [l.count(TRUE) / len(l) for l in labels] print(- * 80) print(context[context]) print(- * 20) print({} ({}).format(target[target], target[pos])) print(, .join([{} ({}%).format(substitute[substitute], round(score * 100)) for substitute, score in sorted(zip(substitutes, scores), key=lambda x: -x[1])])) break

历史基准数据

为了方便，我们将以前的词汇替代基准数据打包在为 Swords ⚔️ 设计的相同 JSON 格式中。CoInCo 基准数据可以在此处下载：

其他基准数据如 SemEval07 和 TWSI 可以通过运行以下脚本创建：

评估新方法

以下是评估新词汇替代方法的示例代码：

生成设置

在生成设置中，词汇替代方法必须为给定目标输出一个排序的替代词候选列表：

python import gzip import json import random import warnings

with gzip.open(swords-v1.1_dev.json.gz, r) as f: swords = json.load(f)

def generate( context, target, target_offset, target_pos=None): substitutes = [be, have, do, say, get, make, go, know, take, see] scores = [random.random() for _ in substitutes] return list(zip(substitutes, scores))

result = {substitutes_lemmatized: True, substitutes: {}} errors = 0 for tid, target in swords[targets].items(): context = swords[contexts][target[context_id]] try: result[substitutes][tid] = generate( context[context], target[target], target[offset], target_pos=target.get(pos)) except: errors += 1 continue

if errors > 0: warnings.warn(f{errors} targets were not evaluated due to errors)

with open(swords-v1.1_dev_mygenerator.lsr.json, w) as f: f.write(json.dumps(result))

排名设置

在排名设置中，词汇替代方法必须根据上下文相关性对基准数据中的候选词列表进行排序：

python import gzip import json import random import warnings

with gzip.open(swords-v1.1_dev.json.gz, r) as f: swords = json.load(f)

def score( context, target, target_offset, substitute, target_pos=None): return random.random()

result = {substitutes_lemmatized: True, substitutes: {}} errors = 0 for sid, substitute in swords[substitutes].items(): tid = substitute[target_id] target = swords[targets][tid] context = swords[contexts][target[context_id]] if tid not in result[substitutes]: result[substitutes][tid] = [] try: substitute_score = score( context[context], target[target], target[offset], substitute[substitute], target_pos=target.get(pos)) result[substitutes][tid].append((substitute[substitute], substitute_score)) except: errors += 1 continue

if errors > 0: warnings.warn(f{errors} substitutes were not evaluated due to errors)

with open(swords-v1.1_dev_myranker.lsr.json, w) as f: f.write(json.dumps(result))

评估方法

要评估上述生成器，请将 swords-v1.1_dev_mygenerator.lsr.json 复制到 notebooks 目录并运行：

sh ./cli.sh eval swords-v1.1_dev --result_json_fp notebooks/swords-v1.1_dev_mygenerator.lsr.json --output_metrics_json_fp notebooks/mygenerator.metrics.json

要评估上述排名器，请将 swords-v1.1_dev_myranker.lsr.json 复制到 notebooks 目录并运行：

sh ./cli.sh eval swords-v1.1_dev --result_json_fp notebooks/swords-v1.1_dev_myranker.lsr.json --output_metrics_json_fp notebooks/myranker.metrics.json --metrics gap_rat

搜集汇总

数据集介绍

构建方式

在词汇替代研究领域，传统基准数据依赖人类记忆生成替代词，存在覆盖范围有限且质量参差不齐的缺陷。SWORDS数据集采用创新构建方法，将词汇替代任务重构为分类问题：首先基于COINCO基准选取上下文与目标词对，继而通过无上下文同义词词典生成候选词库，并融合人类标注的替代词以提升覆盖率。随后采用两阶段标注流程，先由众包标注员对候选词进行二元适宜性判断，再对获得正向反馈的词汇进行精细评分，最终形成包含上下文适宜性分数的三元组数据。

使用方法

该数据集支持生成式与排序式双评估范式。在生成式评估中，系统需自主生成并排序替代词列表，通过精确率、召回率及其调和平均值衡量模型在限定推荐数量下的表现。排序式评估则要求系统对数据集内全部候选词进行适宜性排序，采用广义平均精度指标评估排序质量。数据集配备标准化评估脚本与基线模型实现，支持与SEMEVAL、COINCO等历史基准的对比研究。其精细评分体系允许研究者根据应用需求灵活调整接受阈值，为词汇替代系统在写作辅助等场景的性能评估提供多维视角。

背景与挑战

背景概述

SWORDS（斯坦福词汇替换基准）由斯坦福大学的研究团队于2021年发布，旨在革新词汇替换任务的评估标准。该数据集聚焦于自然语言处理中的词汇替换问题，即根据上下文为特定目标词寻找合适替换词，以辅助写作应用。相较于早期基准如SEMEVAL和COINCO，SWORDS通过重构数据收集方法，将任务形式化为分类问题，利用上下文无关同义词库生成候选词，并依赖人类标注者判断语境适用性，从而显著提升了数据的覆盖范围与质量。其创新性在于打破了传统依赖人类回忆的数据收集局限，为词汇替换系统提供了更全面、可靠的评估框架，推动了自然语言生成与理解领域的发展。

当前挑战

SWORDS数据集面临的挑战主要体现在两个方面：其一，在解决词汇替换领域问题时，需应对语境中词汇替换的复杂性与主观性，例如区分近义词的微妙差异、处理多义词的语义歧义，以及确保替换词在保持原意的同时符合语法与风格要求；其二，在数据构建过程中，研究团队需克服传统方法覆盖率低、质量参差不齐的局限，通过设计新的标注流程（如结合同义词库与人类判断）来平衡覆盖率与质量，并处理标注者间的主观分歧，以确保数据的一致性与可靠性。这些挑战促使SWORDS采用细粒度评分机制，为模型评估提供了更精准的基准。

常用场景

经典使用场景

在自然语言处理领域，词汇替换任务旨在为给定上下文中的目标词寻找语义恰当的替代词汇。SWORDS数据集通过构建高质量、高覆盖率的标注资源，成为评估词汇替换系统的经典基准。其设计初衷源于辅助写作的实际需求，系统需推荐人类难以自发想到的合适词汇，从而提升文本表达的多样性与精确性。

解决学术问题

SWORDS有效解决了传统词汇替换基准在数据覆盖范围与标注质量方面的局限。过往基准依赖人类记忆召回，导致罕见但恰当的替代词缺失，且标注者常提供低质量或不符语境的建议。该数据集通过将任务重构为分类问题，结合词典资源生成候选词并由人类判断其语境适宜性，显著提升了替代词的多样性与准确性，为词汇语义理解、词义消歧等研究提供了更可靠的评估基础。

实际应用

该数据集在智能写作辅助、文本润色及机器翻译等实际场景中具有重要价值。例如，在文档编辑工具中，系统可基于SWORDS推荐的语境化词汇替换，帮助用户优化措辞，增强文本的表现力与专业性。此外，其在教育领域的语言学习应用中也展现出潜力，能够为学习者提供符合语境的近义词示例，促进词汇的深度掌握与灵活运用。

数据集最近研究