Sony/SCaR-Eval

Name: Sony/SCaR-Eval
Creator: Sony
Published: 2026-02-19 05:00:03
License: 暂无描述

Hugging Face2026-02-19 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/Sony/SCaR-Eval

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 datasets: - Sony/SCaR-Train language: - en metrics: - accuracy library_name: transformers tags: - Embedding - Retrieval - Interactive-image-to-text-retrieval --- # SCaR Dataset Card [SCaR-Train](https://huggingface.co/datasets/Sony/SCaR-Train) and [SCaR-Eval](https://huggingface.co/datasets/Sony/SCaR-Eval) are the official datasets for the paper "[VIRTUE: Visual-Interactive Text-Image Universal Embedder](https://arxiv.org/abs/2510.00523)" that are trained with MMEB-Train and SCaR-Train. VIRTUE is a visual-interactive text-image universal embedder consisting of a VLM as well as a segmentation model to enable the visual interaction modality for human interactions. In addition, we introduce the VIRTUE family ([VIRTUE-2B-SCaR](https://huggingface.co/Sony/VIRTUE-2B-SCaR), [VIRTUE-7B-SCaR](https://huggingface.co/Sony/VIRTUE-7B-SCaR)), trained with MMEB-train and SCaR-Train to equip visual-interactive embedding capabilities. SCaR was built from five publicly available datasets: RefCOCO+, RefCOCOg, VisualGenome, COCO-Stuff, and ADE20k. The annotations include images, bounding boxes, and captions that describe entities, relations, and the global scene context. To increase difficulties in reasoning, negative distractors are generated by replacing one of three elements of the ground-truth caption via prompting GPT-4V (OpenAI, 2023) instead of random sampling; for datasets that lack human captions (e.g., ADE20k), we generated ground-truth captions via carefully designed prompts to GPT-4V. To this end, SCaR comprises a vast collection of 1M samples that are divided into training and validation sets. A distinguishing characteristic of the proposed SCaR dataset is its ability to evaluate not only visual-interactive reasoning but also compositional scenarios, requiring models to perform fine-grained, context-aware cross-modal reasoning that goes beyond global image matching. The collection pipeline, statistics, prompt template, and statistics of SCaR can be found in the paper. ## Model Checkpoints - [VIRTUE-2B-SCaR](https://huggingface.co/Sony/VIRTUE-2B-SCaR) - [VIRTUE-7B-SCaR](https://huggingface.co/Sony/VIRTUE-7B-SCaR) ## SCaR Dataset - [SCaR-Train](https://huggingface.co/datasets/Sony/SCaR-Train) - [SCaR-Eval](https://huggingface.co/datasets/Sony/SCaR-Eval) ## Experimental Results ### MMEB - Without SCaR-Train: ![MMEB Results](images/MMEB-results.png) - With SCaR-Train ![MMEB Results with SCaR-Train](images/MMEB-results-with-SCaR.png) ### SCaR ![SCaR Results](images/SCaR-results.png) ## Resources - [Paper](https://arxiv.org/abs/2510.00523) - [Webpage](https://sony.github.io/virtue/) - [Repository](https://github.com/sony/virtue) ## Ethical Considerations _Note: This section is mainly taken from the [AKI](https://huggingface.co/Sony/AKI-4B-phi-3.5-mini) models_. This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. ## Citation ``` @article{wangICLR2026virtue, author = {Wei-Yao Wang and Kazuya Tateishi and Qiyu Wu and Shusuke Takahashi and Yuki Mitsufuji}, title = {VIRTUE: Visual-Interactive Text-Image Universal Embedder}, journal = {arXiv preprint arXiv:2510.00523}, year = {2025} } ```

--- license: cc-by-nc-4.0 datasets: - Sony/SCaR-Train language: - en metrics: - accuracy library_name: transformers tags: - 嵌入（Embedding） - 检索（Retrieval） - 交互式图像到文本检索（Interactive-image-to-text-retrieval） --- # SCaR 数据集卡片 [SCaR-Train](https://huggingface.co/datasets/Sony/SCaR-Train) 与 [SCaR-Eval](https://huggingface.co/datasets/Sony/SCaR-Eval) 是论文《VIRTUE：视觉交互式文本-图像通用嵌入器》的官方配套数据集，该论文提出的模型依托MMEB-Train与SCaR-Train完成训练。 VIRTUE是一款视觉交互式文本-图像通用嵌入器，由视觉语言模型（Vision-Language Model, VLM）与分割模型构成，可为人类交互提供视觉交互模态支持。此外，本工作还推出了VIRTUE系列模型（[VIRTUE-2B-SCaR](https://huggingface.co/Sony/VIRTUE-2B-SCaR)、[VIRTUE-7B-SCaR](https://huggingface.co/Sony/VIRTUE-7B-SCaR)），同样依托MMEB-Train与SCaR-Train训练得到，具备视觉交互式嵌入能力。 SCaR数据集基于5个公开数据集构建：RefCOCO+、RefCOCOg、VisualGenome、COCO-Stuff与ADE20k。其标注内容涵盖图像、边界框以及用于描述实体、关系与全局场景上下文的图像描述文本（caption）。为提升推理任务难度，本数据集未采用随机采样的方式生成负样本干扰项，而是通过向GPT-4V（OpenAI，2023）发起提示，替换真实标注（ground-truth）图像描述文本中的三类元素之一来构造负样本；对于缺乏人工标注描述的数据集（如ADE20k），我们通过精心设计的提示词向GPT-4V生成真实标注图像描述文本。最终，SCaR数据集共包含100万条样本，并划分为训练集与验证集。本数据集的显著特色在于，其不仅可用于评估视觉交互式推理能力，还能测试组合式推理场景，要求模型开展细粒度、上下文感知的跨模态推理，突破了全局图像匹配的局限。SCaR数据集的构建流程、统计信息、提示词模板与相关统计数据均可在论文中查阅。 ## 模型检查点 - [VIRTUE-2B-SCaR](https://huggingface.co/Sony/VIRTUE-2B-SCaR) - [VIRTUE-7B-SCaR](https://huggingface.co/Sony/VIRTUE-7B-SCaR) ## SCaR 数据集 - [SCaR-Train](https://huggingface.co/datasets/Sony/SCaR-Train) - [SCaR-Eval](https://huggingface.co/datasets/Sony/SCaR-Eval) ## 实验结果 ### MMEB 基准测试 - 未使用SCaR-Train训练： ![MMEB 实验结果](images/MMEB-results.png) - 使用SCaR-Train训练： ![引入SCaR-Train后的MMEB实验结果](images/MMEB-results-with-SCaR.png) ### SCaR 基准测试 ![SCaR 实验结果](images/SCaR-results.png) ## 相关资源 - [论文](https://arxiv.org/abs/2510.00523) - [项目主页](https://sony.github.io/virtue/) - [代码仓库](https://github.com/sony/virtue) ## 伦理考量 *注：本节内容主要源自 [AKI](https://huggingface.co/Sony/AKI-4B-phi-3.5-mini) 系列模型的相关文档。* 本项目仅用于支持学术论文的研究用途。我们的模型、数据集与代码并未针对所有下游任务进行专门设计与评估。我们强烈建议用户在部署该模型前，对其准确性、安全性与公平性等潜在问题进行评估与处理。同时，我们鼓励用户充分考量人工智能的普遍局限性，遵守适用法律法规，并在选择应用场景时遵循最佳实践，尤其需避免将其用于可能因错误或滥用对人类生命、权利或安全造成重大影响的高风险场景。 ## 引用格式 @article{wangICLR2026virtue, author = {Wei-Yao Wang and Kazuya Tateishi and Qiyu Wu and Shusuke Takahashi and Yuki Mitsufuji}, title = {VIRTUE: Visual-Interactive Text-Image Universal Embedder}, journal = {arXiv preprint arXiv:2510.00523}, year = {2025} }

提供机构：

Sony

5,000+

优质数据集

54 个

任务类型

进入经典数据集