Joanne/Metaphors_and_Analogies
收藏Hugging Face2023-05-30 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Joanne/Metaphors_and_Analogies
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- question-answering
- token-classification
language:
- en
---
# Metaphors and analogies datasets
These datasets contain word pairs and quadruples forming analogies, metaphoric mapping or sematically unacceptable compositions.
- Pair instances are pairs of nouns A and B in a sentence of the form "A is a B".
- Quadruple instances are of the form : < (A,B),(C,D) >
There is an analogy when A is to B what C is to D.
The analogy is also a metaphor when the (A,B) and (C,D) form a metaphoric mapping, usually when they come from different domains.
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
Language : English
### Datasets and paper links
| Name | Size | Labels | Description |
| ---------: | :----- |:-------- | :-------------------------------------------------------------------------- |
| `Cardillo` | 260 *2 | 1, 2 | Pairs of "A is-a B" sentences composed of one metaphoric and one literal sentence. The two sentences of a given pair share the same B term. |
| `Jankowiak`| 120*3 | 0, 1, 2 | Triples of "A is-a/is-like-a B" sentences with exactly one literal, one semantic abnormal and one metaphoric sentence. |
| `Green` | 40*3 | 0, 1, 2 | Triples of proportional analogies, made of 4 terms <A, B, Ci, Di> each. One stem <A,B> is composed with 3 different <Ci,Di> pairs, to form exaclty one near analogy, one far analogy and one non analogic quadruple|
| `Kmiecik` | 720 | 0, 1, 2 | Quadruples <A,B,C,D> labelled as analogy:True/False and far_analogy: True/False|
| `SAT-met` | 160?*5 | 0, 1, 2, 12 | One pair stem <A,B> to combine with 5 different pairs <Ci,Di> and attempt to form proportional analogies. Only one <Ci,Di> forms an analogy with <A,B> We additionally labelled the analogies as **metaphoric**:True/False|
| Name | Paper Citation | Paper link | Dataset link |
| ---------: | :------- | :------------------------------ |-----------------------------------------: |
| `Cardillo` | | [Cardillo (2010)](https://link.springer.com/article/10.3758/s13428-016-0717-1) [Cardillo (2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2952404/ ) | |
| `Jankowiak`| | [Jankowiak (2020)]( https://link-springer-com.abc.cardiff.ac.uk/article/10.1007/s10936-020-09695-7) | |
| `Green` | Green, A. E., Kraemer, D. J. M., Fugelsang, J., Gray, J. R., & Dunbar, K. (2010). Connecting Long Distance: Semantic Distance in Analogical Reasoning Modulates Frontopolar Cortex Activity. Cerebral Cortex, 10, 70-76. | [Green (20)]() ||
| `Kmiecik` |Kmiecik, M. J., Brisson, R. J., & Morrison, R. G. (2019). The time course of semantic and relational processing during verbal analogical reasoning. Brain and Cognition, 129, 25-34. | [Kmiecik (20)]() ||
| `SAT-met` | | [Turney (2005)](https://arxiv.org/pdf/cs/0508053.pdf) | |
### Labels :
- Pairs
- **0** : anomaly
- **1** : literal
- **2** : metaphor
- Quadruples :
- **0** : not an analogy
- **1** : an analogy but not a metaphor
- **2** : an analogy and a metaphor or a far analogy
- **12** : maybe a metaphor, somewhere between 1 and 2
### Dataset Splits
- Both lexical and random splits are available for classification experiments.
- Size of the splits :
- **train** : 50 %
- **validation** : 10 %
- **test** : 40 %
- Additionally, for all datasets, the `5-folds` field gives frozen splits for a five-folds cross validation experiment with train/val/test = 70/10/20% of the sets.
# Datasets for Classification
- Task : binary classification or 3-classes classification of pairs or quadruples. Each pair or quadruple is to classify between anomaly, non-metaphoric and metaphoric.
## Pairs
### Datasets names & splits :
| Original set | Dataset name | Split |
|-------------:| :------------ | :------ |
| Cardillo | Pairs\_Cardillo\_random_split | random |
| | Pairs\_Cardillo\_lexical_split | lexical |
| Jankowiac | Pairs\_Jankowiac\_random_split | random |
| | Pairs\_Jankowiac\_lexical_split | lexical |
### Data fields :
| Field | Description | Type |
| -------------:| :------------ | ---- |
| corpus | name of the orgiginal dataset | str |
| id | instance id | str |
| set_id | id of the set containing the given instance in the multiple choice task | int |
| label | 0, 1, 2 | int |
| sentence | A is-a B sentence. | str |
| A | A expression in the sentence | str |
| B | B expression in the sentence | str |
| A\_position | position of A in the sentence | list(int) |
| B\_position | position of B in the sentence | list(int) |
| 5-folds | frozen splits for cross validation | list(str) |
### Examples :
| Name | Example | Label|
| -------: | :------------------------------------- | :-------- |
|Cardillo | | |
|Jankowiac | | |
## Quadruples
### Datasets names & splits
| Original set | dataset name | Split |
| -------: | :------------------------------------- | :-------- |
|Green | Quadruples\_Green\_random_split | random |
| | Quadruples\_Green\_lexical_split | lexical |
|Kmiecik | Quadruples\_Kmiecik\_random_split | random |
| | Quadruples\_Kmiecik\_lexical\_split\_on\_AB | lexical AB |
| | Quadruples\_Kmiecik\_lexical_split\_on\_CD | lexical CD |
|SAT | Quadruples\_SAT\_random\_split | random | random |
| | Quadruples\_SAT\_lexical\_split | lexical | lexical |
### Data fields :
| Field| Description | Type |
| -------------: | :------------ | :------------ |
| corpus | Name of the orgiginal dataset | str |
| id | Element id | str |
| set\_id | Id of the set containing the given instance in the multiple-choice task datasets | int |
| label | 0, 1, 2, 12 | int |
| AB | pair of terms | list(str) |
| CD | pair of terms | list(str) |
| 5-folds | frozen splits for cross validation | list(str) |
### Examples :
| Name | Example | Label|
|-------: | :------------------------------------- | :-------- |
|Green | | |
|Kmiecik | | |
| SAT | | |
# Datasets for multiple choice questions or permutation
- Task : One stem and multiple choices. The stem and its possible combinations are to be combined to form a sentence. The resulting sentence has a label <0,1,2>.
## Pairs
### Datasets names & splits :
| Original set | dataset name | Split |
| -----------|------| :---- |
| Cardillo | Pairs\_Cardillo\_set | test only |
| Jankowiac | Pairs\_Jankowiac\_set |test only |
### Data fields :
| Field | Description | Type |
| -------------: | :------------ | :------------ |
| corpus | Name of the orgiginal dataset | str |
| id | Element id | str |
| pair_ids | Ids of each pair as appearing in the classification datasets. | list(str) |
| labels | 0, 1, 2 | list(int) |
| sentences | List of the sentences composing the set | list(str) |
| A\_positions | Positions of the A's in each sentence | list(list(int)) |
| B\_positions | Positions of the B's in each sentence | list(list(int)) |
| answer | Index of the metaphor | int |
| stem | Term shared between the sentences of the set. | str |
| 5-folds | frozen splits for cross validation | list(str) |
### Examples :
| Name | Stem | Sentences |Label|
|-------: |-------: | :------------------------------------- | :-------- |
|Cardillo | comet | The astronomer's obssession was a comet. | 1 |
| | | The politician's career was a comet. | 2 |
| Jankoviac | harbour | This banana is like a harbour | 0 |
| | | A house is a harbour | 2|
| | | This area is a harbour | 1 |
## Quadruples
### Datasets names & splits :
| Original set | dataset name | Split |
| ----------: | :------| :---- |
| Green | Quadruples\_Green\_set | test only |
| SAT | Quadruples\_SAT\_met_set | test only |
### Data fields :
| Field | Description | Type |
|-------------: | :------------ | :------------ |
| corpus | name of the orgiginal dataset | str |
| id | Element id | str |
| pair\_ids | Ids of the instances as appearing in the clasification datasets | list(str) |
| labels | 0, 1, 2, 12 | list(int) |
| answer | temp | int |
| stem | Word pair to compose with all the other pairs of the set | list(str) |
| pairs | List of word pairs | list(list(str)) |
| 5-folds | Frozen splits for cross validation | list(str) |
### Examples :
| Name | Example | Label|
|-------: | :------------------------------------- | :-------- |
|Green | | |
| | | |
| SAT | | |
提供机构:
Joanne
原始信息汇总
隐喻和类比数据集
这些数据集包含形成类比、隐喻映射或语义不可接受组合的词对和四元组。
- 词对实例:形式为“A is a B”的句子中的名词 A 和 B 对。
- 四元组实例:形式为 <(A,B),(C,D)>,当 A 对 B 的关系与 C 对 D 的关系相同时,存在类比。类比也是隐喻,当 (A,B) 和 (C,D) 形成隐喻映射时,通常来自不同领域。
数据集描述
数据集和论文链接
| 名称 | 大小 | 标签 | 描述 |
|---|---|---|---|
Cardillo |
260 *2 | 1, 2 | 由一个隐喻句和一个字面句组成的“A is-a B”句子对,两个句子共享相同的 B 词。 |
Jankowiak |
120*3 | 0, 1, 2 | “A is-a/is-like-a B”句子的三元组,包含一个字面句、一个语义异常句和一个隐喻句。 |
Green |
40*3 | 0, 1, 2 | 比例类比的四元组,由 4 个词 <A, B, Ci, Di> 组成。一个主干 <A,B> 与 3 个不同的 <Ci,Di> 对组合,形成一个近类比、一个远类比和一个非类比四元组。 |
Kmiecik |
720 | 0, 1, 2 | 四元组 <A,B,C,D>,标记为类比:True/False 和远类比:True/False。 |
SAT-met |
160?*5 | 0, 1, 2, 12 | 一个主干对 <A,B> 与 5 个不同的对 <Ci,Di> 组合,尝试形成比例类比。只有一个 <Ci,Di> 与 <A,B> 形成类比。我们还标记了类比为 隐喻:True/False。 |
标签
-
词对:
- 0 : 异常
- 1 : 字面
- 2 : 隐喻
-
四元组:
- 0 : 非类比
- 1 : 类比但非隐喻
- 2 : 类比和隐喻或远类比
- 12 : 可能是隐喻,介于 1 和 2 之间
数据集拆分
-
分类实验可用的词汇和随机拆分。
- 拆分大小:
- 训练 : 50 %
- 验证 : 10 %
- 测试 : 40 %
- 拆分大小:
-
此外,所有数据集的
5-folds字段提供冻结拆分,用于五折交叉验证实验,训练/验证/测试 = 70/10/20%。
分类数据集
- 任务:词对或四元组的二分类或三分类。每个词对或四元组分类为异常、非隐喻和隐喻。
词对
数据集名称和拆分
| 原始集 | 数据集名称 | 拆分 |
|---|---|---|
| Cardillo | Pairs_Cardillo_random_split | random |
| Pairs_Cardillo_lexical_split | lexical | |
| Jankowiac | Pairs_Jankowiac_random_split | random |
| Pairs_Jankowiac_lexical_split | lexical |
数据字段
| 字段 | 描述 | 类型 |
|---|---|---|
| corpus | 原始数据集名称 | str |
| id | 实例 ID | str |
| set_id | 多选任务中包含给定实例的集合 ID | int |
| label | 0, 1, 2 | int |
| sentence | A is-a B 句子 | str |
| A | 句子中的 A 表达式 | str |
| B | 句子中的 B 表达式 | str |
| A_position | 句子中 A 的位置 | list(int) |
| B_position | 句子中 B 的位置 | list(int) |
| 5-folds | 交叉验证的冻结拆分 | list(str) |
四元组
数据集名称和拆分
| 原始集 | 数据集名称 | 拆分 |
|---|---|---|
| Green | Quadruples_Green_random_split | random |
| Quadruples_Green_lexical_split | lexical | |
| Kmiecik | Quadruples_Kmiecik_random_split | random |
| Quadruples_Kmiecik_lexical_split_on_AB | lexical AB | |
| Quadruples_Kmiecik_lexical_split_on_CD | lexical CD | |
| SAT | Quadruples_SAT_random_split | random |
| Quadruples_SAT_lexical_split | lexical |
数据字段
| 字段 | 描述 | 类型 |
|---|---|---|
| corpus | 原始数据集名称 | str |
| id | 元素 ID | str |
| set_id | 多选任务数据集中包含给定实例的集合 ID | int |
| label | 0, 1, 2, 12 | int |
| AB | 词对 | list(str) |
| CD | 词对 | list(str) |
| 5-folds | 交叉验证的冻结拆分 | list(str) |
多选题或排列数据集
- 任务:一个主干和多个选项。主干及其可能的组合要组合成一个句子。生成的句子有一个标签 <0,1,2>。
词对
数据集名称和拆分
| 原始集 | 数据集名称 | 拆分 |
|---|---|---|
| Cardillo | Pairs_Cardillo_set | test only |
| Jankowiac | Pairs_Jankowiac_set | test only |
数据字段
| 字段 | 描述 | 类型 |
|---|---|---|
| corpus | 原始数据集名称 | str |
| id | 元素 ID | str |
| pair_ids | 分类数据集中每个对的 ID | list(str) |
| labels | 0, 1, 2 | list(int) |
| sentences | 组成集合的句子列表 | list(str) |
| A_positions | 每个句子中 A 的位置 | list(list(int)) |
| B_positions | 每个句子中 B 的位置 | list(list(int)) |
| answer | 隐喻的索引 | int |
| stem | 集合中句子共享的词 | str |
| 5-folds | 交叉验证的冻结拆分 | list(str) |
四元组
数据集名称和拆分
| 原始集 | 数据集名称 | 拆分 |
|---|---|---|
| Green | Quadruples_Green_set | test only |
| SAT | Quadruples_SAT_met_set | test only |
数据字段
| 字段 | 描述 | 类型 |
|---|---|---|
| corpus | 原始数据集名称 | str |
| id | 元素 ID | str |
| pair_ids | 分类数据集中实例的 ID | list(str) |
| labels | 0, 1, 2, 12 | list(int) |
| answer | 临时 | int |
| stem | 与集合中所有其他对组合的词对 | list(str) |
| pairs | 词对列表 | list(list(str)) |
| 5-folds | 交叉验证的冻结拆分 | list(str) |



