floschne/multilingual-llava-bench-in-the-wild

Name: floschne/multilingual-llava-bench-in-the-wild
Creator: floschne
Published: 2024-05-16 13:33:36
License: 暂无描述

Hugging Face2024-05-16 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/floschne/multilingual-llava-bench-in-the-wild

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ar - bn - zh - en - fr - ru - es - ur - hi - ja license: cc-by-4.0 size_categories: - n<1K pretty_name: Multilingual LLaVA Bench in the Wild dataset_info: features: - name: image_id dtype: string - name: image struct: - name: bytes dtype: binary - name: path dtype: 'null' - name: image_caption dtype: string - name: question_id dtype: int64 - name: question dtype: string - name: question_category dtype: string - name: gpt4_answer dtype: string - name: gpt4_model_id dtype: string splits: - name: english num_bytes: 131853762 num_examples: 60 - name: russian num_bytes: 131895540 num_examples: 60 - name: hindi num_bytes: 131932797 num_examples: 60 - name: bengali num_bytes: 131926779 num_examples: 60 - name: chinese num_bytes: 131847250 num_examples: 60 - name: spanish num_bytes: 131858886 num_examples: 60 - name: japanese num_bytes: 131867258 num_examples: 60 - name: arabic num_bytes: 131880090 num_examples: 60 - name: french num_bytes: 131860194 num_examples: 60 - name: urdu num_bytes: 131888639 num_examples: 60 download_size: 515733256 dataset_size: 1318811195 configs: - config_name: default data_files: - split: english path: data/english-* - split: russian path: data/russian-* - split: hindi path: data/hindi-* - split: bengali path: data/bengali-* - split: chinese path: data/chinese-* - split: spanish path: data/spanish-* - split: japanese path: data/japanese-* - split: arabic path: data/arabic-* - split: french path: data/french-* - split: urdu path: data/urdu-* --- # Multilingual LLaVA Bench in the Wild ### Note that this is a copy from https://huggingface.co/datasets/MBZUAI/multilingual-llava-bench-in-the-wild It was created due to issues in the original repo. It also includes the image features and has a uniform and joined structure. If you use this dataset, please cite the original authors: ```bibtex @article{PALO2024, title={Palo: A Large Multilingual Multimodal Language Model}, author={Maaz, Muhammad and Rasheed, Hanoona and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao M. and Baldwin, Tim and Felsberg, Michael and Khan, Fahad S.}, journal={arXiv 2402.14818}, year={2024}, url={https://arxiv.org/abs/2402.14818} } ``` ### How to load the image features Due to a [bug](https://github.com/huggingface/datasets/issues/4796), the images cannot be stored as `PIL.Image.Image`s directly but needed to be converted to `dataset.Image`s-. Hence, to load them this additional step is required: ```python from datasets import Image, load_dataset ds = load_dataset("floschne/multilingual-llava-bench-in-the-wild", split="english") ds = ds.map(lambda sample: {"image_t": Image().decode_example(sample["image"])}, remove_columns=["image"]).rename_column("image_t", "image") ```

提供机构：

floschne

原始信息汇总

数据集概述

基本信息

名称: Multilingual LLaVA Bench in the Wild
语言: 阿拉伯语 (ar), 孟加拉语 (bn), 中文 (zh), 英语 (en), 法语 (fr), 俄语 (ru), 西班牙语 (es), 乌尔都语 (ur), 印地语 (hi), 日语 (ja)
许可证: cc-by-4.0
大小分类: n<1K

数据集特征

image_id: 字符串类型
image: 结构化数据，包含 bytes (二进制类型) 和 path (空类型)
image_caption: 字符串类型
question_id: 整数类型 (int64)
question: 字符串类型
question_category: 字符串类型
gpt4_answer: 字符串类型
gpt4_model_id: 字符串类型

数据集分割

english: 60个示例，总字节数131853762
russian: 60个示例，总字节数131895540
hindi: 60个示例，总字节数131932797
bengali: 60个示例，总字节数131926779
chinese: 60个示例，总字节数131847250
spanish: 60个示例，总字节数131858886
japanese: 60个示例，总字节数131867258
arabic: 60个示例，总字节数131880090
french: 60个示例，总字节数131860194
urdu: 60个示例，总字节数131888639

数据集大小

下载大小: 515733256字节
数据集大小: 1318811195字节

配置

config_name: default
data_files:
- split: 不同语言的数据分割
- path: 对应语言数据的路径模式，如 data/english-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集