Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset
收藏Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset
作者: Heejeong Nam*, Jinwoo Ahn*
年份: 2024
arXiv链接: https://arxiv.org/abs/2411.14137
数据集下载
下载链接: Dataset Link
说明:
- 将图像数据放置在
data/annotated_images/目录下。 - 将基准文件夹 (
vague_benchmark和vague_fewshot) 放置在适当目录中。
推理
环境
兼容模型与Python、PyTorch和Transformers版本
| 兼容模型 | Python版本 | PyTorch版本 | Transformers版本 |
|---|---|---|---|
| internvl2, instructblip, llavanextmodels, llava1.5 | 3.10 | 2.2.0 | 4.46.2 |
| sharegpt4v | 3.10 | 2.0.2 | 4.31.0 |
前提条件
使用 share GPT 时,在 src 目录下安装 internLM_XComposer。
示例推理代码
bash model=instructblip_7b for format in plain; do python inference.py --answer_type mcq --format $format --model ${model} --device cuda:2 python inference.py --answer_type da --format $format --model ${model} --device cuda:2 done
数据集示例
Json格式
json { "image_name": "-Fmvuy-4U_A@9", "direct": "Hey, person2, please stop waving your arms over the table.", "indirect": "Hey person2, looks like youre conducting an invisible orchestra here, arent you?", "solution": "(person2, stop, arms)", "meta": { "caption": "Two people are seated at a round dining table covered with a white tablecloth. The table is set with plates of fruit and a vase of yellow flowers. Person 2 is waving their arms animatedly while person 1 looks on.", "ram_entity": [ "table", "dinning table", "plate", "flower", "food", "fruit", "platter", "round table", "tablecloth" ], "fake_caption": "In the sunlit park, person2 stands with eyes closed, their hands moving gracefully through the air as if drawing the crescendo of a symphony. "Hey person2, looks like youre conducting an invisible orchestra here, arent you?" someone muses, noticing the captivating harmony in the gentle sway of the surrounding trees.", "img_size": { "width": 1280, "height": 720 }, "person_bbox": [ [ 219.88351440429688, 194.55238342285156, 487.8915100097656, 500.213623046875 ], [ 589.6340942382812, 166.94029235839844, 994.0988159179688, 471.9652099609375 ] ] }, "mcq": { "1_correct": "The speaker wants person2 to stop waving their arms around.", "2_incorrect_fake_scene": "The speaker wants person2 to capture the trees movement as part of their performance.", "3_incorrect_surface_understanding": "The speaker wants person2 to provide some music to accompany their arm movements, as it seems like they are conducting an orchestra.", "4_incorrect_entity": "The speaker wants person2 to use the baton to conduct more effectively.", "ordering": [ "C", "A", "B", "D" ] } }
引用
如果此数据集对您的研究或使用有帮助,请按如下方式引用:
@misc{nam2024visualcontextsclarifyambiguous, title={Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset}, author={Heejeong Nam and Jinwoo Ahn}, year={2024}, eprint={2411.14137}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.14137}, }




