TheMrguiller/BilbaoQA
收藏Hugging Face2023-08-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/TheMrguiller/BilbaoQA
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: caption
dtype: string
- name: image
dtype: image
- name: question
dtype: string
- name: choices
dtype: string
- name: answer
dtype: string
- name: solution
dtype: string
- name: CTH
dtype: bool
splits:
- name: train
num_bytes: 1368875715
num_examples: 3960
- name: test
num_bytes: 346986615
num_examples: 990
download_size: 1709263149
dataset_size: 1715862330
task_categories:
- question-answering
- visual-question-answering
language:
- en
tags:
- code
size_categories:
- 100B<n<1T
---
# Dataset Card for "BilbaoQA"
## Dataset Description
- **Homepage:** https://github.com/TheMrguiller/MUCSI_Modal
- **Repository:** https://github.com/TheMrguiller/MUCSI_Modal
- **Paper:** It is a follow up of the Flamingo model paper
- **Leaderboard:**
- **Point of Contact:** https://github.com/TheMrguiller/MUCSI_Modal
### Dataset Summary
This dataset was collected for a proyect for a master degree in Computation and Intelligent System from University of Deusto. It was done by students and recolected from webpages famous in the Basque Country: Deia and Getimages. The questions and answers were created using a set of models that are able to generate this information from a description of a text.
### Supported Tasks and Leaderboards
The dataset is prepared to used it for visual question-answering.
### Languages
The dataset is in english.
## Dataset Structure
### Data Fields
- `image`: This field has the image, which is the context given to the model.
- `question`: This field incorporates the question that has to answer the model from the image context.
- `choices`: Multiple choice selection.
- `answer`: The answer from the multiple choice.
- `solution`: The chain of thought process of the solution selection.
- `CTH`: A flag that indicates whether it doesnt have chain of thought in that row.
### Data Splits
The dataset is split in 80% train and 20% test.
## Considerations for Using the Data
The dataset has some flaws regarding to the descriptions. The descriptions sometimes are to specific for a captioning task. There are also to many futbol match data, so it isnt to well balanced. There are also some description that are to generic. There are some repetition in the answers due to the bad quality of the descriptions, be aware of this.
## Additional Information
### Dataset Curators
The curators of this dataset where the students from the Masters degree in Computation and Inteligent Systems from University of Deusto.
提供机构:
TheMrguiller
原始信息汇总
数据集概述
数据集名称
- 名称: BilbaoQA
数据集描述
- 目的: 用于视觉问答任务,支持多选题形式的问答。
- 收集来源: 数据来源于巴斯克地区的知名网页Deia和Getimages。
- 语言: 英语
数据集结构
-
特征:
caption: 字符串类型image: 图像类型question: 字符串类型choices: 字符串类型answer: 字符串类型solution: 字符串类型CTH: 布尔类型
-
数据分割:
train: 3960个样本,总大小1368875715字节test: 990个样本,总大小346986615字节
使用注意事项
- 数据集存在描述过于具体或过于通用的问题,以及答案重复和数据不平衡(如足球比赛数据过多)的情况。
数据集规模
- 大小范围: 100B<n<1T
任务类别
- 问答
- 视觉问答



