Common-O

Name: Common-O
Creator: maas
Published: 2025-12-05 12:14:53
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/facebook/Common-O

下载链接

链接失效反馈

官方服务：

资源简介：

# Common-O > measuring multimodal reasoning across scenes Common-O, inspired by cognitive tests for humans, probes multimodal LLMs' ability to reason across scenes by asking "what’s in common?" ![fair conference content copy.001](https://cdn-uploads.huggingface.co/production/uploads/64c17345e82e55936cf971bc/5av7avUrsBjFuMrWuOiCW.jpeg) Common-O is comprised of household objects: ![fair conference content copy.003](https://cdn-uploads.huggingface.co/production/uploads/64c17345e82e55936cf971bc/hEvVz2uFR6z-jv1em25eY.jpeg) We have two subsets: Common-O (3 - 8 objects) and Common-O Complex (8 - 16 objects). ## Multimodal LLMs excel at single image perception, but struggle with multi-scene reasoning ![single_vs_multi_image(1)](https://cdn-uploads.huggingface.co/production/uploads/64c17345e82e55936cf971bc/1cB9iXHrSgyvfXgK6gmGu.png) ## Evaluating a Multimodal LLM on Common-O ```python import datasets # get a sample common_o = datasets.load_dataset("facebook/Common-O")["main"] # common_o_complex = datasets.load_dataset("facebook/Common-O")["complex"] x = common_o[3] output: str = model(x["image_1"], x["image_2"], x["question"]) check_answer(output, x["answer"]) ``` To check the answer, we use an exact match criteria: ```python import re def check_answer(generation: str, ground_truth: str) -> bool: """ Args: generation: model response, expected to contain "Answer: ..." ground_truth: comma-separated string of correct answers Returns: bool, whether the prediction matches the ground truth """ preds = generation.split("\n")[-1] preds = re.sub("Answer:", "", preds) preds = preds.split(",") preds = [p.strip() for p in preds] preds = sorted(preds, key=lambda x: x[0]) # split into a list ground_truth_list = [a.strip() for a in ground_truth.split(",")] ground_truth_list = sorted(ground_truth_list) return preds == ground_truth_list ``` Some models have specific formatting outputs for their answers, e.g. \boxed{A} or Answer: A. We recommend checking a few responses as you may notice slight variations based on this. This public set also has slight variations with the set used in the original paper, so while the measured capabilities are identical do not expect an exact replication of accuracy figures. If you'd like to use a single image model, here's a handy function to turn `image_1` and `image_2` into a single split image: ```python from PIL import Image def concat_images_horizontal( image1: Image.Image, image2: Image.Image, include_space: bool=True, space_width: int=20, fill_color: tuple=(0, 0, 0) ) -> Image.Image: # from https://note.nkmk.me/en/python-pillow-concat-images/ if not include_space: dst = Image.new("RGB", (image1.width + image2.width, image1.height)) dst.paste(image1, (0, 0)) dst.paste(image2, (image1.width, 0)) else: total_width = image1.width + space_width + image2.width max_height = max(image1.height, image2.height) dst = Image.new("RGB", (total_width, max_height), color=fill_color) dst.paste(image1, (0, (max_height - image1.height) // 2)) dst.paste(image2, (image1.width + space_width, (max_height - image2.height) // 2)) return dst ``` For more details about Common-O see the - [dataset card](https://huggingface.co/datasets/facebook/Common-O/blob/main/COMMON_O_DATASET_CARD.md) - [ArXiv Paper](https://arxiv.org/abs/2511.03768) Cite: ``` @inproceedings{Ross2025what0s, title = {What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes}, author = {Candace Ross and Florian Bordes and Adina Williams and Polina Kirichenko and Mark Ibrahim}, year = {2025}, url = {https://openreview.net/attachment?id=d0F0N0cu4n&name=supplementary_material} } ```

# Common-O ## 跨场景多模态推理评测 Common-O 借鉴人类认知测试的设计思路，通过提问「二者有何共通之处？」，评测多模态大语言模型（Large Language Model, LLM）的跨场景推理能力。 ![fair conference content copy.001](https://cdn-uploads.huggingface.co/production/uploads/64c17345e82e55936cf971bc/5av7avUrsBjFuMrWuOiCW.jpeg) Common-O 的数据集样本均由家居物品构成： ![fair conference content copy.003](https://cdn-uploads.huggingface.co/production/uploads/64c17345e82e55936cf971bc/hEvVz2uFR6z-jv1em25eY.jpeg) 该数据集包含两个子集：Common-O（包含3~8个物品）与 Common-O Complex（包含8~16个物品）。 ## 多模态大语言模型擅长单图像感知，但在多场景推理中存在短板 ![single_vs_multi_image(1)](https://cdn-uploads.huggingface.co/production/uploads/64c17345e82e55936cf971bc/1cB9iXHrSgyvfXgK6gmGu.png) ## 在Common-O上评测多模态大语言模型 python import datasets # 获取一个样本 common_o = datasets.load_dataset("facebook/Common-O")["main"] # common_o_complex = datasets.load_dataset("facebook/Common-O")["complex"] x = common_o[3] output: str = model(x["image_1"], x["image_2"], x["question"]) check_answer(output, x["answer"]) 答案校验采用精确匹配准则： python import re def check_answer(generation: str, ground_truth: str) -> bool: """ Args: generation: 模型生成的回复，预期包含「Answer: ...」格式内容 ground_truth: 以逗号分隔的正确答案字符串 Returns: 布尔值，表示预测结果是否与标准答案匹配 """ preds = generation.split(" ")[-1] preds = re.sub("Answer:", "", preds) preds = preds.split(",") preds = [p.strip() for p in preds] preds = sorted(preds, key=lambda x: x[0]) # 将标准答案拆分为列表 ground_truth_list = [a.strip() for a in ground_truth.split(",")] ground_truth_list = sorted(ground_truth_list) return preds == ground_truth_list 部分模型的答案输出存在特定格式，例如 oxed{A} 或「Answer: A」。建议您先校验少量样本，因格式差异可能导致结果存在细微变化。本公开数据集与原论文中使用的版本存在细微差异，因此尽管评测的能力维度完全一致，但请勿期望能完全复现原论文中的准确率数值。若您希望使用单图像模型，可通过以下便捷函数将 `image_1` 与 `image_2` 拼接为单张图像： python from PIL import Image def concat_images_horizontal( image1: Image.Image, image2: Image.Image, include_space: bool=True, space_width: int=20, fill_color: tuple=(0, 0, 0) ) -> Image.Image: # 参考自 https://note.nkmk.me/en/python-pillow-concat-images/ if not include_space: dst = Image.new("RGB", (image1.width + image2.width, image1.height)) dst.paste(image1, (0, 0)) dst.paste(image2, (image1.width, 0)) else: total_width = image1.width + space_width + image2.width max_height = max(image1.height, image2.height) dst = Image.new("RGB", (total_width, max_height), color=fill_color) dst.paste(image1, (0, (max_height - image1.height) // 2)) dst.paste(image2, (image1.width + space_width, (max_height - image2.height) // 2)) return dst 如需了解Common-O的更多细节，请参阅： - [数据集卡片](https://huggingface.co/datasets/facebook/Common-O/blob/main/COMMON_O_DATASET_CARD.md) - [ArXiv论文](https://arxiv.org/abs/2511.03768) ### 引用格式： @inproceedings{Ross2025what0s, title = {What’s in Common? Multimodal Models Hallucinate When Reasoning Across Scenes}, author = {Candace Ross and Florian Bordes and Adina Williams and Polina Kirichenko and Mark Ibrahim}, year = {2025}, url = {https://openreview.net/attachment?id=d0F0N0cu4n&name=supplementary_material} }

提供机构：

maas

创建时间：

2025-10-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集