TheMrguiller/BilbaoQA

Name: TheMrguiller/BilbaoQA
Creator: TheMrguiller
Published: 2023-08-24 11:48:31
License: 暂无描述

Hugging Face2023-08-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/TheMrguiller/BilbaoQA

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: caption dtype: string - name: image dtype: image - name: question dtype: string - name: choices dtype: string - name: answer dtype: string - name: solution dtype: string - name: CTH dtype: bool splits: - name: train num_bytes: 1368875715 num_examples: 3960 - name: test num_bytes: 346986615 num_examples: 990 download_size: 1709263149 dataset_size: 1715862330 task_categories: - question-answering - visual-question-answering language: - en tags: - code size_categories: - 100B<n<1T --- # Dataset Card for "BilbaoQA" ## Dataset Description - **Homepage:** https://github.com/TheMrguiller/MUCSI_Modal - **Repository:** https://github.com/TheMrguiller/MUCSI_Modal - **Paper:** It is a follow up of the Flamingo model paper - **Leaderboard:** - **Point of Contact:** https://github.com/TheMrguiller/MUCSI_Modal ### Dataset Summary This dataset was collected for a proyect for a master degree in Computation and Intelligent System from University of Deusto. It was done by students and recolected from webpages famous in the Basque Country: Deia and Getimages. The questions and answers were created using a set of models that are able to generate this information from a description of a text. ### Supported Tasks and Leaderboards The dataset is prepared to used it for visual question-answering. ### Languages The dataset is in english. ## Dataset Structure ### Data Fields - `image`: This field has the image, which is the context given to the model. - `question`: This field incorporates the question that has to answer the model from the image context. - `choices`: Multiple choice selection. - `answer`: The answer from the multiple choice. - `solution`: The chain of thought process of the solution selection. - `CTH`: A flag that indicates whether it doesnt have chain of thought in that row. ### Data Splits The dataset is split in 80% train and 20% test. ## Considerations for Using the Data The dataset has some flaws regarding to the descriptions. The descriptions sometimes are to specific for a captioning task. There are also to many futbol match data, so it isnt to well balanced. There are also some description that are to generic. There are some repetition in the answers due to the bad quality of the descriptions, be aware of this. ## Additional Information ### Dataset Curators The curators of this dataset where the students from the Masters degree in Computation and Inteligent Systems from University of Deusto.

提供机构：

TheMrguiller

原始信息汇总

数据集概述

数据集名称

名称: BilbaoQA

数据集描述

目的: 用于视觉问答任务，支持多选题形式的问答。
收集来源: 数据来源于巴斯克地区的知名网页Deia和Getimages。
语言: 英语

数据集结构

特征:
- caption: 字符串类型
- image: 图像类型
- question: 字符串类型
- choices: 字符串类型
- answer: 字符串类型
- solution: 字符串类型
- CTH: 布尔类型
数据分割:
- train: 3960个样本，总大小1368875715字节
- test: 990个样本，总大小346986615字节

使用注意事项

数据集存在描述过于具体或过于通用的问题，以及答案重复和数据不平衡（如足球比赛数据过多）的情况。

数据集规模

大小范围: 100B<n<1T

任务类别

问答
视觉问答

5,000+

优质数据集

54 个

任务类型

进入经典数据集