K-LLaVA-W

Name: K-LLaVA-W
Creator: maas
Published: 2025-12-05 16:43:08
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-26 收录

下载链接：

https://modelscope.cn/datasets/NCSOFT/K-LLaVA-W

下载链接

链接失效反馈

官方服务：

资源简介：

# K-LLaVA-W We introduce **K-LLaVA-W**, a Korean adaptation of the [LLaVA-Bench-in-the-wild](https://arxiv.org/abs/2304.08485) [1] designed for evaluating vision-language models. By translating the LLaVA-Bench-in-the-wild into Korean and carefully reviewing its naturalness through human inspection, we developed a novel robust evaluation benchmark specifically for Korean language. (Since our goal was to build a benchmark exclusively focused in Korean, we change the English texts in images into Korean for localization.) K-LLaVA-W contains 24 images of various domains and 60 daily-life questions, allowing a thorough evaluation of model performance in Korean. For more details, Please refer to the VARCO-VISION technical report. - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103) - **Blog(Korean):** [VARCO-VISION Technical Report Summary](https://ncsoft.github.io/ncresearch/95ad8712e60063e9ac97538504ac3eea0ac530af) - **Huggingface Version Model:** [NCSOFT/VARCO-VISION-14B-HF](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF) - **Evaluation Repository:** [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) <table> <tr> <th>Image</th> <th>LLaVA-Bench-in-the-wild</th> <th>K-LLaVA-W</th> </tr> <tr> <td width=200><img src="https://cdn-uploads.huggingface.co/production/uploads/624ceaa38746b2f5773c2d1c/SQgVUuJ831NQ0Rr9_5Bp0.jpeg"></td> <td> question: What is the name of this famous sight in the photo? caption: An aerial view of Diamond Head in the Hawaiian Islands. gpt_answer: The famous sight in the photo is Diamond Head. </td> <td> question: 사진에 나오는 이 유명한 장소의 이름은 무엇인가요? caption: 하와이 제도의 다이아몬드 헤드를 공중에서 본 모습입니다. gpt_answer: 이 사진은 하와이에 있는 다이아몬드 헤드입니다. </td> </tr> </table> ## Inference Prompt ``` <image> {question} ``` ## Evaluation Prompt ``` [설명] {caption} [질문] {question} [어시스턴트 1] {gpt_answer} [어시스턴트 1 끝] [어시스턴트 2] {target_model_answer} [어시스턴트 2 끝] [System] 두 인공지능 어시스턴트의 성능을 [질문]에 대한 응답에 기반하여 평가하세요. 해당 [질문]은 특정 이미지를 보고 생성되었습니다. `유용성`, `관련성`, `정확성`, `세부 수준`, `한국어 생성능력`을 기준으로 응답을 평가하세요. 각각의 어시스턴트에게 1에서 10까지의 전반적인 점수를 부여하며, 높은 점수일수록 더 나은 전반적인 성능을 나타냅니다. # 단계 1. 제공된 이미지 [설명]을 검토하세요. 2. 각 어시스턴트의 응답을 다음 기준으로 분석하세요: - `유용성`: 응답이 사용자의 질문을 얼마나 잘 해결하는가? - `관련성`: 응답이 사용자의 질문에 얼마나 적절한가? - `정확성`: 응답에서 제공한 정보가 얼마나 정확한가? - `세부 수준`: 응답이 과하지 않게 충분히 자세한가? - `한국어 생성능력`: 생성된 한국어 문장이 자연스럽고 문법적으로 올바른가? 3. 분석에 기반하여 각 어시스턴트에게 1에서 10까지의 점수를 부여하세요. 4. 두 점수를 공백으로 구분하여 한 줄로 제공하세요. 5. 점수에 대한 이유를 강조하면서 포괄적인 평가를 제공하고, 편견을 피하며 응답의 순서가 판단에 영향을 미치지 않도록 하세요. # 출력 형식 - 첫 번째 줄: `어시스턴트1_점수 어시스턴트2_점수` (예: `8 9`) - 두 번째 줄: `유용성`, `관련성`, `정확성`, `세부 수준`, `한국어 생성능력` 기준으로 점수를 설명하는 자세한 문단을 제공합니다. # 주의사항 - 평가 시 잠재적 편견을 방지하여 객관성을 확보하세요. - 분석과 설명에서 일관성과 명확성을 유지하세요. ``` ## Results Below are the evaluation results of various vision-language models, including [VARCO-VISION-14B](https://huggingface.co/NCSOFT/VARCO-VISION-14B) on K-LLaVA-W. | | VARCO-VISION-14B | Pangea-7B | Pixtral-12B | Molmo-7B-D-0924 | Qwen2-VL-7B-Instruct | LLaVA-One-Vision-7B | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | K-LLaVA-W | **84.74** | 69.70 | 82.00 | 63.90 | 62.00 | 48.80 | ## References [1] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. ## Citation If you use K-LLaVA-W in your research, please cite the following: ```bibtex @misc{ju2024varcovisionexpandingfrontierskorean, title={VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models}, author={Jeongho Ju and Daeyoung Kim and SunYoung Park and Youngjune Kim}, year={2024}, eprint={2411.19103}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.19103}, } ```

# K-LLaVA-W 我们提出**K-LLaVA-W**，这是专为评估视觉语言模型而设计的[野外视觉语言基准（LLaVA-Bench-in-the-wild）](https://arxiv.org/abs/2304.08485)[1]的韩语适配版本。我们将原野外视觉语言基准（LLaVA-Bench-in-the-wild）翻译为韩语，并通过人工审阅确保其语言自然性，从而构建了一款专为韩语场景打造的鲁棒型多模态评估基准。（由于我们的目标是打造完全聚焦韩语的基准数据集，因此我们将图像中的英文文本替换为韩语以实现本地化适配。）K-LLaVA-W包含24个覆盖多样领域的图像与60个日常场景问题，可用于全面评估模型在韩语环境下的性能表现。如需了解更多细节，请参考VARCO-VISION技术报告。 - **技术报告**：[VARCO-VISION：拓展韩语多模态大模型前沿](https://arxiv.org/pdf/2411.19103) - **韩语博客**：[VARCO-VISION 技术报告摘要](https://ncsoft.github.io/ncresearch/95ad8712e60063e9ac97538504ac3eea0ac530af) - **Hugging Face 模型版本**：[NCSOFT/VARCO-VISION-14B-HF](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF) - **评估代码仓库**：[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) <table> <tr> <th>图像</th> <th>野外视觉语言基准（LLaVA-Bench-in-the-wild）</th> <th>K-LLaVA-W</th> </tr> <tr> <td width=200><img src="https://cdn-uploads.huggingface.co/production/uploads/624ceaa38746b2f5773c2d1c/SQgVUuJ831NQ0Rr9_5Bp0.jpeg"></td> <td> 问题：这张照片中的著名景点叫什么名字？ 描述：夏威夷群岛钻石头山的航拍视角。 GPT 回答：照片中的著名景点是钻石头山。 </td> <td> 问题：사진에 나오는 이 유명한 장소의 이름은 무엇인가요? 描述：하와이 제도의 다이아몬드 헤드를 공중에서 본 모습입니다. GPT 回答：이 사진은 하와이에 있는 다이아몬드 헤드입니다. </td> </tr> </table> ## 推理提示词 <image> {question} ## 评估提示词 [说明] {caption} [问题] {question} [助手1] {gpt_answer} [助手1 结束] [助手2] {target_model_answer} [助手2 结束] [系统提示] 请基于两位人工智能助手的回答，针对[问题]进行性能评估。本次[问题]基于特定图像生成。请以`有用性`、`相关性`、`准确性`、`细节程度`、`韩语生成能力`为基准对助手的回答进行评分。为每位助手赋予1至10分的综合得分，分数越高代表整体性能越好。 # 步骤 1. 审查提供的图像[说明]内容。 2. 根据以下标准分析每位助手的回答： - `有用性`：回答对用户问题的解决程度如何？ - `相关性`：回答与用户问题的匹配度是否恰当？ - `准确性`：回答中提供的信息是否准确？ - `细节程度`：回答是否详略得当，既不过于简略也不过于冗余？ - `韩语生成能力`：生成的韩语语句是否自然流畅、语法正确？ 3. 基于上述分析，为每位助手赋予1至10分的得分。 4. 将两位助手的得分以空格分隔，输出为一行。 5. 结合得分理由提供全面的评估，避免主观偏见，确保评估不受回答顺序的影响。 # 输出格式 - 第一行：`助手1_得分助手2_得分`（示例：`8 9`） - 第二行：以`有用性`、`相关性`、`准确性`、`细节程度`、`韩语生成能力`为基准，详细阐述得分理由的段落。 # 注意事项 - 评估过程中需避免潜在偏见，确保客观性。 - 分析与描述需保持一致性与清晰性。 ## 评估结果以下是包括[VARCO-VISION-14B](https://huggingface.co/NCSOFT/VARCO-VISION-14B)在内的多款多模态视觉语言模型在K-LLaVA-W上的评估结果。 | | VARCO-VISION-14B | Pangea-7B | Pixtral-12B | Molmo-7B-D-0924 | Qwen2-VL-7B-Instruct | LLaVA-One-Vision-7B | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | K-LLaVA-W | **84.74** | 69.70 | 82.00 | 63.90 | 62.00 | 48.80 | ## 参考文献 [1] Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. Visual Instruction Tuning. Advances in Neural Information Processing Systems, 36, 2024. ## 引用格式若您在研究中使用K-LLaVA-W，请引用以下文献： bibtex @misc{ju2024varcovisionexpandingfrontierskorean, title={VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models}, author={Jeongho Ju and Daeyoung Kim and SunYoung Park and Youngjune Kim}, year={2024}, eprint={2411.19103}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.19103}, }

提供机构：

maas

创建时间：

2025-07-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集