mair-lab/lave-human-feedback

Name: mair-lab/lave-human-feedback
Creator: mair-lab
Published: 2024-04-16 04:25:58
License: 暂无描述

Hugging Face2024-04-16 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/mair-lab/lave-human-feedback

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: dataset dtype: string - name: model dtype: string - name: qid dtype: int64 - name: question dtype: string - name: references sequence: string - name: prediction dtype: string - name: human_score dtype: float64 splits: - name: dev num_bytes: 930874 num_examples: 7000 - name: test num_bytes: 3774759 num_examples: 22050 download_size: 1623960 dataset_size: 4705633 configs: - config_name: default data_files: - split: dev path: data/dev-* - split: test path: data/test-* license: cc-by-4.0 tags: - human-feedback --- # LAVE human judgments This repository contains the human judgment data for [Improving Automatic VQA Evaluation Using Large Language Models](https://arxiv.org/abs/2310.02567). Details about the data collection process and crowdworker population can be found in our paper, specifically in section 5.2 and appendix A.1. Fields: * **dataset:** VQA dataset of origin for this example (`vqav2`, `vgqa`, `okvqa`). * **model:** VQA model that generated the predicted answer (`blip2`, `promptcap`, `blip_vqa`, `blip_vg`). * **qid:** question ID coming from the original dataset. * **question**: question copied from the original dataset for convenience. * **references:** reference answers copied from the original dataset for convenience. * **prediction:** candidate answer generated by the VQA model. * **human_score:** human judgment score, with `0` meaning incorrect answer, `0.5` ambiguous or incomplete answer, and `1` correct answer. ## Usage ```python from datasets import load_dataset # Load the dev split dataset = load_dataset("mair-lab/lave-human-feedback", split="dev") # Filter examples by dataset and model dataset = dataset.filter(lambda example: example["dataset"] == "vqav2" and example["model"] == "blip2") ```

提供机构：

mair-lab

原始信息汇总

数据集概述

数据集特征

dataset: 字符串类型，原始VQA数据集来源（vqav2, vgqa, okvqa）。
model: 字符串类型，生成预测答案的VQA模型（blip2, promptcap, blip_vqa, blip_vg）。
qid: 整数类型，原始数据集中的问题ID。
question: 字符串类型，从原始数据集中复制的问题。
references: 字符串序列类型，从原始数据集中复制的参考答案。
prediction: 字符串类型，VQA模型生成的候选答案。
human_score: 浮点数类型，人类判断分数，其中0表示答案错误，0.5表示答案模糊或不完整，1表示答案正确。

数据集拆分

dev: 包含7000个示例，总大小为930874字节。
test: 包含22050个示例，总大小为3774759字节。

数据集大小

下载大小: 1623960字节。
数据集总大小: 4705633字节。

数据文件配置

config_name: default
data_files:
- split: dev, path: data/dev-*
- split: test, path: data/test-*

许可证

cc-by-4.0