dinhanhx/VQAv2-vi

Name: dinhanhx/VQAv2-vi
Creator: dinhanhx
Published: 2023-09-21 10:25:06
License: 暂无描述

Hugging Face2023-09-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/dinhanhx/VQAv2-vi

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - vi pretty_name: VQAv2 in Vietnamese source-datasets: - VQAv2 tags: - VQAv2-vi - VQA license: unknown task_categories: - visual-question-answering task_ids: - visual-question-answering --- # VQAv2 in Vietnamese This is Google-translated version of [VQAv2](https://visualqa.org/) in Vietnamese. The process of building Vietnamese version as follows: - In `en/` folder, - Download `v2_OpenEnded_mscoco_train2014_questions.json` and `v2_mscoco_train2014_annotations.json` from [VQAv2](https://visualqa.org/). - Remove key `answers` of key `annotations` from `v2_mscoco_train2014_annotations.json`. I shall use key `multiple_choice_answer` of key `annotations` only. Let call the new file `v2_OpenEnded_mscoco_train2014_answers.json` - By using [set data structure](https://docs.python.org/3/tutorial/datastructures.html#sets), I generate `question_list.txt` and `answer_list.txt` of unique text. There are 152050 unique questions and 22531 unique answers from 443757 image-question-answer triplets. - In `vi/` folder, - By translating two `en/.txt` files, I generate `answer_list.jsonl` and `question_list.jsonl`. In each of entry of each file, the key is the original english text, the value is the translated text in vietnamese. To load Vietnamese version in your code, you need original English version. Then just use English text as key to retrieve Vietnamese value from `answer_list.jsonl` and `question_list`. I provide both English and Vietnamese version. Please refer to [this code](https://github.com/dinhanhx/velvet/blob/main/scripts/apply_translate_vqav2.py) to apply translation.

提供机构：

dinhanhx

原始信息汇总

数据集概述

数据集名称

VQAv2 in Vietnamese

语言

英语 (en)
越南语 (vi)

源数据集

VQAv2

许可

未知

任务类别

视觉问答 (visual-question-answering)

任务ID

visual-question-answering

数据集构建过程

英语部分 (en/ 文件夹)
- 下载 v2_OpenEnded_mscoco_train2014_questions.json 和 v2_mscoco_train2014_annotations.json。
- 从 v2_mscoco_train2014_annotations.json 中移除 annotations 键下的 answers 键。仅使用 annotations 键下的 multiple_choice_answer。新文件命名为 v2_OpenEnded_mscoco_train2014_answers.json。
- 使用集合数据结构生成 question_list.txt 和 answer_list.txt，包含152050个唯一问题和22531个唯一答案，来自443757个图像-问题-答案三元组。
越南语部分 (vi/ 文件夹)
- 通过翻译两个 en/.txt 文件，生成 answer_list.jsonl 和 question_list.jsonl。每个文件的每个条目中，键为原始英语文本，值为越南语翻译文本。

使用指南

在代码中加载越南语版本时，需要原始英语版本。使用英语文本作为键，从 answer_list.jsonl 和 question_list.jsonl 中检索越南语值。
参考此代码应用翻译。

5,000+

优质数据集

54 个

任务类型

进入经典数据集