资源简介:
---
license: mit
task_categories:
- question-answering
language:
- hi
- id
- su
- jv
- kn
- sw
- yo
size_categories:
- 1K<n<10K
---
# Dataset Card for multi-figqa
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-instances)
- [Data Splits](#data-instances)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** [Needs More Information]
- **Repository:** [Multi-FigQA](https://github.com/simran-khanuja/Multilingual-Fig-QA)
- **Paper:** [Multi-lingual and Multi-cultural Figurative Language Understanding
](https://arxiv.org/abs/2305.16171)
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Emmy Liu](emmy@cmu.edu)
### Dataset Summary
A multilingual dataset of human-written creative figurative expressions in many languages (mostly metaphors and similes). The English version (with the same format) can be found [here](https://huggingface.co/datasets/nightingal3/fig-qa)
### Languages
Languages included are Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili, and Yoruba. The language codes are respectively `hi`, `id`, `kn`, `su`, `sw`, and `yo`.
## Dataset Structure
### Data Instances
```
{
'startphrase': the phrase,
'ending1': one possible answer,
'ending2': another possible answer,
'labels': 0 if ending1 is correct else 1
}
```
### Data Splits
All data in each language is originally intended to be used as a test set for that language.
## Dataset Creation
### Curation Rationale
Figurative language permeates human communication, but at the same time is relatively understudied in NLP. Datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is an expression of our cultural and societal experiences, making it difficult for these phrases to be universally applicable. We created this dataset as part of an effort to introduce more culturally relevant training data for different languages and cultures.
### Source Data
#### Who are the source language producers?
The language producers were hired to write creative sentences in their native languages.
## Additional Information
### Citation Information
Please use this citation if you found this helpful:
```
@misc{kabra2023multilingual,
title={Multi-lingual and Multi-cultural Figurative Language Understanding},
author={Anubha Kabra and Emmy Liu and Simran Khanuja and Alham Fikri Aji and Genta Indra Winata and Samuel Cahyawijaya and Anuoluwapo Aremu and Perez Ogayo and Graham Neubig},
year={2023},
eprint={2305.16171},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
许可证:MIT协议
任务类别:
- 问答
语言:
- 印地语(`hi`)
- 印尼语(`id`)
- 巽他语(`su`)
- 爪哇语(`jv`)
- 卡纳达语(`kn`)
- 斯瓦希里语(`sw`)
- 约鲁巴语(`yo`)
样本规模:`1K < n < 10K`
---
# multi-figqa 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概览](#dataset-summary)
- [支持的任务与排行榜](#supported-tasks-and-leaderboards)
- [语言分布](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注流程](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [已知其他局限](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集构建者](#dataset-curators)
- [许可证信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页:[待补充更多信息]**
- **代码仓库:[Multi-FigQA](https://github.com/simran-khanuja/Multilingual-Fig-QA)**
- **论文:[多语言与多文化比喻语言理解](https://arxiv.org/abs/2305.16171)**
- **排行榜:[待补充更多信息]**
- **联系方式:[Emmy Liu](emmy@cmu.edu)**
### 数据集概览
本数据集为多语言数据集,收录人类创作的创意比喻性表达(以隐喻(metaphor)和明喻(simile)为主)。同格式的英语版本可于[此处](https://huggingface.co/datasets/nightingal3/fig-qa)获取。
### 语言分布
本数据集涵盖的语言包括印地语、印尼语、爪哇语、卡纳达语、巽他语、斯瓦希里语和约鲁巴语,对应的语言代码依次为`hi`、`id`、`kn`、`su`、`sw`及`yo`。
## 数据集结构
### 数据实例
json
{
"startphrase": 待处理短语,
"ending1": 候选答案一,
"ending2": 候选答案二,
"labels": 若候选答案一正确则取值为0,反之则为1
}
### 数据字段
(暂无详细说明)
### 数据划分
各语言的全部数据最初均被设计为对应语言的测试集。
## 数据集构建
### 遴选依据
比喻性语言广泛渗透于人类沟通场景中,但在自然语言处理(Natural Language Processing,NLP)领域相关研究仍相对滞后。此前已有英语数据集被构建,以推动语言模型(Language Models,LMs)对比喻性语言的理解与评测研究进展。然而,比喻性语言的使用是文化与社会经验的具象表达,导致英语相关数据集难以具备跨文化普适性。本数据集的构建旨在为不同语言与文化场景提供更贴合文化背景的训练数据。
### 源数据
#### 源语言内容创作者是谁?
源语言创作者均为受邀以母语创作创意语句的人员。
### 标注流程
(暂无详细说明)
### 个人与敏感信息
(暂无详细说明)
## 数据集使用注意事项
### 数据集的社会影响
(暂无详细说明)
### 偏差讨论
(暂无详细说明)
### 已知其他局限
(暂无详细说明)
## 附加信息
### 数据集构建者
(暂无详细说明)
### 许可证信息
(暂无详细说明)
### 引用信息
若本数据集对您的研究有所帮助,请采用如下引用格式:
bibtex
@misc{kabra2023multilingual,
title={Multi-lingual and Multi-cultural Figurative Language Understanding},
author={Anubha Kabra and Emmy Liu and Simran Khanuja and Alham Fikri Aji and Genta Indra Winata and Samuel Cahyawijaya and Anuoluwapo Aremu and Perez Ogayo and Graham Neubig},
year={2023},
eprint={2305.16171},
archivePrefix={arXiv},
primaryClass={cs.CL}
}