five

dcml0714/Heros

收藏
Hugging Face2023-06-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/dcml0714/Heros
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 size_categories: - n<1K --- HEROS is a dataset used to compare the sentence cosine similarity among sentences with high lexical overlapping but differ in their semantics. Please refer to the paper, "Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS" for more details of how the dataset is constructed and the comparison of different sentence encoders. The dataset `heros.tsv` consists of 6 columns: `Original`, `Synonym`, `Antonym`, `Negation`, `Random`, `Typo`, `Negation`. The first column, `Original` are the sentences from GoEmotion dataset, and sentences in the other columns are constructed by replacing some words in the original sentences based on different rules, making up different subsets in HEROS. Different subsets in HEROS capture various aspects of semantics. Comparing the average cosine similarity between minimal pairs in Synonym and Antonym allows one to understand whether replacing a word with an antonym is more dissimilar to the original semantics than replacing a word with a synonym. The average cosine similarity between minimal pairs in Negation can tell us how negation affects sentence embedding similarity. Typos are realistic and happen every day. While humans can infer the original word from a typo and get the original meaning of the sentence, it will be interesting to see how the typos affect the sentences' similarity with the original sentences. The Random MLM subset can tell us how similar the sentence embedding can be when two sentences are semantically different but with high lexical overlaps. By comparing the performance of different SEs on different subsets in HEROS, we can further understand the trait of different SEs.
提供机构:
dcml0714
原始信息汇总

HEROS数据集概述

数据集描述

HEROS数据集用于比较具有高词汇重叠但语义不同的句子之间的余弦相似度。该数据集通过构造不同语义方面的子集,来评估和比较不同的句子编码器(Sentence Encoders)。

数据集构成

  • 文件名: heros.tsv
  • 列数: 6列
  • 列内容:
    • Original: 来自GoEmotion数据集的句子
    • Synonym: 通过替换原句中某些词汇形成的句子
    • Antonym: 通过替换原句中某些词汇形成的句子
    • Negation: 通过替换原句中某些词汇形成的句子
    • Random: 通过替换原句中某些词汇形成的句子
    • Typo: 通过替换原句中某些词汇形成的句子

数据集用途

  • 通过比较SynonymAntonym子集中的最小对平均余弦相似度,可以理解替换一个词为反义词是否比替换为同义词更不相似于原句的语义。
  • 通过Negation子集中的最小对平均余弦相似度,可以了解否定如何影响句子嵌入相似度。
  • 通过Typo子集,可以研究打字错误对句子与原句相似度的影响。
  • 通过Random MLM子集,可以了解两个语义不同但词汇重叠度高的句子之间的嵌入相似度。

数据集规模

  • 规模类别: n<1K

许可证

  • 许可证: Apache-2.0
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作