mair-lab/lave-human-feedback
收藏Hugging Face2024-04-16 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/mair-lab/lave-human-feedback
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: dataset
dtype: string
- name: model
dtype: string
- name: qid
dtype: int64
- name: question
dtype: string
- name: references
sequence: string
- name: prediction
dtype: string
- name: human_score
dtype: float64
splits:
- name: dev
num_bytes: 930874
num_examples: 7000
- name: test
num_bytes: 3774759
num_examples: 22050
download_size: 1623960
dataset_size: 4705633
configs:
- config_name: default
data_files:
- split: dev
path: data/dev-*
- split: test
path: data/test-*
license: cc-by-4.0
tags:
- human-feedback
---
# LAVE human judgments
This repository contains the human judgment data for [Improving Automatic VQA Evaluation Using Large Language Models](https://arxiv.org/abs/2310.02567). Details about the data collection process and crowdworker population can be found in our paper, specifically in section 5.2 and appendix A.1.
Fields:
* **dataset:** VQA dataset of origin for this example (`vqav2`, `vgqa`, `okvqa`).
* **model:** VQA model that generated the predicted answer (`blip2`, `promptcap`, `blip_vqa`, `blip_vg`).
* **qid:** question ID coming from the original dataset.
* **question**: question copied from the original dataset for convenience.
* **references:** reference answers copied from the original dataset for convenience.
* **prediction:** candidate answer generated by the VQA model.
* **human_score:** human judgment score, with `0` meaning incorrect answer, `0.5` ambiguous or incomplete answer, and `1` correct answer.
## Usage
```python
from datasets import load_dataset
# Load the dev split
dataset = load_dataset("mair-lab/lave-human-feedback", split="dev")
# Filter examples by dataset and model
dataset = dataset.filter(lambda example: example["dataset"] == "vqav2" and example["model"] == "blip2")
```
提供机构:
mair-lab
原始信息汇总
数据集概述
数据集特征
- dataset: 字符串类型,原始VQA数据集来源(
vqav2,vgqa,okvqa)。 - model: 字符串类型,生成预测答案的VQA模型(
blip2,promptcap,blip_vqa,blip_vg)。 - qid: 整数类型,原始数据集中的问题ID。
- question: 字符串类型,从原始数据集中复制的问题。
- references: 字符串序列类型,从原始数据集中复制的参考答案。
- prediction: 字符串类型,VQA模型生成的候选答案。
- human_score: 浮点数类型,人类判断分数,其中
0表示答案错误,0.5表示答案模糊或不完整,1表示答案正确。
数据集拆分
- dev: 包含7000个示例,总大小为930874字节。
- test: 包含22050个示例,总大小为3774759字节。
数据集大小
- 下载大小: 1623960字节。
- 数据集总大小: 4705633字节。
数据文件配置
- config_name: default
- data_files:
- split: dev, path: data/dev-*
- split: test, path: data/test-*
许可证
- cc-by-4.0
标签
- human-feedback



