msalnikov/mintaka
收藏Hugging Face2024-04-18 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/msalnikov/mintaka
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- found
language:
- en
license:
- cc-by-4.0
multilinguality:
- ar
- de
- ja
- hi
- pt
- en
- es
- it
- fr
size_categories:
- 100K<n<1M
source_datasets:
- https://huggingface.co/datasets/AmazonScience/mintaka
task_categories:
- question-answering
task_ids:
- open-domain-qa
paperswithcode_id: mintaka
pretty_name: Mintaka
language_bcp47:
- ar-SA
- de-DE
- ja-JP
- hi-HI
- pt-PT
- en-EN
- es-ES
- it-IT
- fr-FR
dataset_info:
- config_name: T5-Large-SSM-answers
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: question
dtype: string
- name: answerText
dtype: string
- name: category
dtype: string
- name: complexityType
dtype: string
- name: questionEntity
list:
- name: entityType
dtype: string
- name: label
dtype: string
- name: mention
dtype: string
- name: name
dtype: string
- name: span
sequence: int32
- name: answerEntity
list:
- name: label
dtype: string
- name: name
dtype: string
- name: generated_answers
sequence: string
splits:
- name: train
num_bytes: 17007018
num_examples: 14000
- name: validation
num_bytes: 3009249
num_examples: 2000
- name: test
num_bytes: 4805840
num_examples: 4000
download_size: 9102570
dataset_size: 24822107
- config_name: T5-Large-SSM-answers-linked
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: question
dtype: string
- name: answerText
dtype: string
- name: category
dtype: string
- name: complexityType
dtype: string
- name: questionEntity
list:
- name: entityType
dtype: string
- name: label
dtype: string
- name: mention
dtype: string
- name: name
dtype: string
- name: span
sequence: int32
- name: answerEntity
list:
- name: label
dtype: string
- name: name
dtype: string
- name: generated_answers
sequence: string
- name: linked_generated_answers
sequence:
sequence: string
splits:
- name: test
num_bytes: 7736223
num_examples: 4000
download_size: 2940268
dataset_size: 7736223
- config_name: T5-XL-SSM-answers
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: question
dtype: string
- name: answerText
dtype: string
- name: category
dtype: string
- name: complexityType
dtype: string
- name: questionEntity
list:
- name: entityType
dtype: string
- name: label
dtype: string
- name: mention
dtype: string
- name: name
dtype: string
- name: span
sequence: int32
- name: answerEntity
list:
- name: label
dtype: string
- name: name
dtype: string
- name: generated_answers
sequence: string
splits:
- name: train
num_bytes: 35216238
num_examples: 28000
- name: validation
num_bytes: 2593667
num_examples: 2000
- name: test
num_bytes: 5095967
num_examples: 4000
download_size: 14752335
dataset_size: 42905872
- config_name: default
features:
- name: id
dtype: string
- name: lang
dtype: string
- name: question
dtype: string
- name: answerText
dtype: string
- name: category
dtype: string
- name: complexityType
dtype: string
- name: questionEntity
list:
- name: name
dtype: string
- name: entityType
dtype: string
- name: label
dtype: string
- name: mention
dtype: string
- name: span
list: int32
- name: answerEntity
list:
- name: name
dtype: string
- name: label
dtype: string
- name: relevant_triplets
sequence:
sequence: string
- name: verbalized_relevant_triplets
sequence: string
- name: relevant_triplets_g2t
dtype: string
splits:
- name: train
num_bytes: 19478013
num_examples: 14000
- name: validation
num_bytes: 2791664
num_examples: 2000
- name: test
num_bytes: 5572329
num_examples: 4000
download_size: 8236687
dataset_size: 27842006
configs:
- config_name: T5-Large-SSM-answers
data_files:
- split: train
path: T5-Large-SSM-answers/train-*
- split: validation
path: T5-Large-SSM-answers/validation-*
- split: test
path: T5-Large-SSM-answers/test-*
- config_name: T5-Large-SSM-answers-linked
data_files:
- split: test
path: T5-Large-SSM-answers-linked/test-*
- config_name: T5-XL-SSM-answers
data_files:
- split: train
path: T5-XL-SSM-answers/train-*
- split: validation
path: T5-XL-SSM-answers/validation-*
- split: test
path: T5-XL-SSM-answers/test-*
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
# Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering
Extended version of original Mintaka dataset with extracted relevan triplets for questions entities (Method from [KAPING](https://arxiv.org/abs/2306.04136v1))
Relevant triplets converted to text by [T5 model tuned on WebNLG dataset](https://huggingface.co/s-nlp/g2t-t5-xl-webnlg)
In addition, provided generated answers from [T5 Large SSM](https://huggingface.co/msalnikov/kgqa-mintaka-t5-large-ssm) and [T5 XXL SSM](https://huggingface.co/msalnikov/kgqa-mintaka-t5-xl-ssm-nq) models, tuned on Mintaka in corresponding config names.
提供机构:
msalnikov
原始信息汇总
数据集概述
数据集名称: Mintaka
数据集任务:
- 任务类别:question-answering
- 任务ID:open-domain-qa
数据集语言:
- 支持语言:ar, de, ja, hi, pt, en, es, it, fr
- 语言BCP47代码:ar-SA, de-DE, ja-JP, hi-HI, pt-PT, en-EN, es-ES, it-IT, fr-FR
数据集许可证: cc-by-4.0
数据集大小:
- 大小类别:100K<n<1M
- 下载大小:9102570
- 数据集总大小:24822107
数据集配置
配置名称: T5-Large-SSM-answers
- 特征:
- id: string
- lang: string
- question: string
- answerText: string
- category: string
- complexityType: string
- questionEntity:
- entityType: string
- label: string
- mention: string
- name: string
- span: sequence: int32
- answerEntity:
- label: string
- name: string
- generated_answers: sequence: string
- 分割:
- train: 17007018 bytes, 14000 examples
- validation: 3009249 bytes, 2000 examples
- test: 4805840 bytes, 4000 examples
配置名称: T5-Large-SSM-answers-linked
- 特征:
- id: string
- lang: string
- question: string
- answerText: string
- category: string
- complexityType: string
- questionEntity:
- entityType: string
- label: string
- mention: string
- name: string
- span: sequence: int32
- answerEntity:
- label: string
- name: string
- generated_answers: sequence: string
- linked_generated_answers: sequence: sequence: string
- 分割:
- test: 7736223 bytes, 4000 examples
配置名称: T5-XL-SSM-answers
- 特征:
- id: string
- lang: string
- question: string
- answerText: string
- category: string
- complexityType: string
- questionEntity:
- entityType: string
- label: string
- mention: string
- name: string
- span: sequence: int32
- answerEntity:
- label: string
- name: string
- generated_answers: sequence: string
- 分割:
- train: 35216238 bytes, 28000 examples
- validation: 2593667 bytes, 2000 examples
- test: 5095967 bytes, 4000 examples
配置名称: default
- 特征:
- id: string
- lang: string
- question: string
- answerText: string
- category: string
- complexityType: string
- questionEntity:
- name: string
- entityType: string
- label: string
- mention: string
- span: list: int32
- answerEntity:
- name: string
- label: string
- relevant_triplets: sequence: sequence: string
- verbalized_relevant_triplets: sequence: string
- relevant_triplets_g2t: string
- 分割:
- train: 19478013 bytes, 14000 examples
- validation: 2791664 bytes, 2000 examples
- test: 5572329 bytes, 4000 examples



