couto/bjj-vqa
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/couto/bjj-vqa
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: BJJ-VQA
license: cc-by-sa-4.0
tags:
- vision-language-models
- visual-question-answering
- benchmark
- inspect-ai
- bjj
- brazilian-jiu-jitsu
task_categories:
- visual-question-answering
language:
- en
size_categories:
- n<1K
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
dataset_info:
features:
- name: id
dtype: string
- name: image
dtype: string
- name: question
dtype: string
- name: choices
list: string
- name: answer
dtype: string
- name: experience_level
dtype: string
- name: category
dtype: string
- name: subject
dtype: string
- name: source
dtype: string
splits:
- name: test
num_bytes: 4655128
num_examples: 57
download_size: 4642073
dataset_size: 4655128
---
# BJJ-VQA
A Visual Question Answering benchmark that tests whether Vision-Language
Models can reason about Brazilian Jiu-Jitsu mechanics — not just recognize
technique names.
Each question presents a still frame from a CC-licensed instructional video
and asks *why* a specific visible detail matters. The correct answer cannot
be identified from text alone.
## Setup
```bash
uv sync
```
## Run an evaluation
```bash
export ANTHROPIC_AUTH_TOKEN=your-token
uv run inspect eval src/bjj_vqa/task.py --model anthropic/claude-opus-4-5
uv run inspect view
```
Any model supported by [inspect-ai](https://inspect.aisi.org.uk/providers.html)
works. Results go in `.eval_results/` in the model's repo.
## Dataset
1 question · gi only · single video source · intermediate
Images live in `data/images/` and are committed to this repo. The packaged
dataset (images + metadata) is published to Hugging Face Hub on each GitHub
release.
→ [huggingface.co/datasets/couto/bjj-vqa](https://huggingface.co/datasets/couto/bjj-vqa)
## Contributing
Contributions are pairs of (image + question) submitted as a single PR.
**Image requirements**
- JPEG, extracted manually from a CC BY or CC BY-SA YouTube video
- Filename: next sequential 5-digit ID (e.g. `00006.jpg`)
- Commit the frame directly
**Question requirements**
- Question text must establish full situational context (position, what both
athletes are doing) so no prior frame is needed
- Ask *why* a visible detail matters — never ask *what* technique is shown
- All 4 choices must be plausible to someone who trains
- Correct answer must not be identifiable from text alone
- Answers distributed across A/B/C/D (no letter more than twice, no repeats
in consecutive questions)
**JSON fields** (add to `data/samples.json`):
```json
{
"id": "00006",
"image": "images/00006.jpg",
"question": "...",
"choices": ["...", "...", "...", "..."],
"answer": "B",
"experience_level": "beginner",
"category": "gi",
"subject": "submissions",
"source": "https://www.youtube.com/watch?v=EXAMPLE&t=83s"
}
```
Allowed values: `experience_level` → `beginner /
intermediate / advanced` · `category` → `gi / no_gi` · `subject` → `guard / passing /
submissions / controls / escapes / takedowns`
**Attribution** — add a line to the Sources section below for any new video.
**Generating candidates** — paste the prompt below into Gemini with a CC
video attached. Output requires your review before submission.
<details>
<summary>Question generation prompt (Gemini)</summary>
```
You are a BJJ black belt with competition experience in gi and no-gi.
Watch the attached video. Generate exactly 5 questions. Be concise.
---
## Context
These questions are for BJJ-VQA, a Visual Question Answering benchmark that
tests whether AI vision models can reason about what is happening on the mat,
not just recognize techniques by name.
A VQA benchmark presents a model with an image and a multiple-choice
question. The model must choose the correct answer by reasoning about what
it sees. This creates a specific failure mode called a language shortcut: if
a model can identify the correct answer by reading the question and options
alone, without processing the image, the question is invalid.
A question is free of language shortcuts when:
- The correct answer cannot be guessed from BJJ knowledge alone
- The correct answer is not identifiable as the longest, most complete, or
most technically worded option
- All 4 options are plausible to someone who trains but has not seen this frame
- The image is the deciding factor
---
## Question Construction
Every question must be self-contained. Write the question so it establishes
full situational context so no prior frame is needed. The image reveals only
the specific visible detail being asked about.
Ask WHY a visible detail matters mechanically. Never ask WHAT technique is
shown. Plain mat language only, no anatomy terms.
If SHORTCUT_RISK is MEDIUM or HIGH, rewrite before submitting.
---
## Answer Distribution
Spread correct answers across A, B, C, D. No letter appears more than twice.
No letter repeats in two consecutive questions. Vary grammatical structure
across options.
---
## Format
TIMESTAMP: [MM:SS]
QUESTION: [self-contained context + specific visible detail]
A) ...
B) ...
C) ...
D) ...
ANSWER: [A / B / C / D]
CONCEPT: [2-5 words, plain mat language]
EXPERIENCE_LEVEL: [beginner / intermediate / advanced]
CATEGORY: [gi / no_gi]
SUBJECT: [guard / passing / submissions / controls / escapes / takedowns]
RATIONALE: [Coach talking to a student. Why correct? Why each wrong option fails?]
SHORTCUT_RISK: [LOW / MEDIUM / HIGH]
---
## Distractor Rules
For each question, the four options must collectively include:
- One option applying a real BJJ principle to the wrong situation
- One option partially correct but wrong about the mechanism
- One option describing the opposite of what is happening
- The correct answer — not the longest or most complete-sounding option
---
## Coverage
After the 5 questions:
- SUBJECT distribution:
- EXPERIENCE_LEVEL distribution:
- Highest SHORTCUT_RISK and why:
- Frame with most occlusion risk:
- What is missing that the next video should cover:
```
</details>
## Sources
All frames extracted from Creative Commons licensed videos.
| Video | Author | License | Used at |
|-------|--------|---------|---------|
| [Armlock X Triangulo Partindo da Guarda Fechada](https://youtube.com/watch?v=SzL_uObk8fk) | Cobrinha BJJ & Fitness | [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) | 00001 |
When using this dataset, please attribute:
> Frames from "Armlock X Triangulo Partindo da Guarda Fechada" by Cobrinha
> Brazilian Jiu-Jitsu & Fitness, CC BY 4.0.
## Citation
```bibtex
@dataset{bjj_vqa_2026,
author = {Matheus Couto},
title = {BJJ-VQA: Brazilian Jiu-Jitsu Visual Question Answering Benchmark},
year = {2026},
url = {https://huggingface.co/datasets/couto/bjj-vqa},
license = {CC BY-SA 4.0}
}
```
**Code**: GPL-3.0 · **Dataset**: CC BY-SA 4.0
提供机构:
couto
搜集汇总
数据集介绍

构建方式
在视觉问答领域,BJJ-VQA数据集的构建体现了严谨的学术范式。其核心方法是从知识共享许可的巴西柔术教学视频中手动截取静态帧作为视觉素材。每一道题目均由领域专家精心设计,严格遵循“无语言捷径”原则,确保正确答案无法仅凭文本信息或柔术知识推断得出。构建过程强调问题的自包含性,要求题目文本完整描述情境,使图像成为解答的关键决定因素。贡献者需按照标准化模板提交数据,并接受严格的答案分布与合理性审查,从而保障了数据集的科学性与可靠性。
特点
该数据集最显著的特点在于其深度专业性与精准的评估导向。它聚焦于巴西柔术这一特定领域,要求模型理解复杂的力学原理而非简单的动作识别。数据集中的问题均围绕“为何”某个视觉细节重要展开,这迫使模型必须进行深层次的视觉推理。其结构设计精妙,每个问题均包含四个对训练者而言均具合理性的选项,且正确答案均匀分布于A、B、C、D之间,有效避免了评估偏差。此外,数据集还标注了经验等级、类别和主题等多维度元数据,为细粒度模型分析提供了可能。
使用方法
作为一项专业基准测试,BJJ-VQA主要用于评估视觉语言模型在特定领域的高级推理能力。研究人员可通过Hugging Face平台直接加载该数据集,利用其提供的图像、问题及多选答案对模型进行零样本或小样本评估。数据集与inspect-ai等评估框架兼容,用户可便捷地配置不同模型进行测试,结果将自动保存以供分析。在使用时,需遵循知识共享许可协议,对所使用的视频素材进行规范署名。该数据集适用于检验模型是否真正理解视觉场景中的因果机制,是推动领域专用AI发展的有效工具。
背景与挑战
背景概述
在视觉语言模型快速发展的背景下,通用视觉问答基准往往难以评估模型对特定专业领域的深度推理能力。BJJ-VQA数据集由Matheus Couto于2026年创建,专注于巴西柔术这一复杂格斗运动领域。该数据集旨在探究视觉语言模型能否超越单纯的技术动作识别,深入理解图像中具体视觉细节的力学原理与战术意义,从而填补了专业运动分析在视觉推理评估方面的空白。其核心研究问题在于模型是否具备基于视觉信息进行领域特异性因果推理的能力,而非依赖文本先验知识,这对推动模型在专业场景下的实用化具有重要意义。
当前挑战
该数据集致力于解决视觉问答在高度专业化领域面临的独特挑战,其核心在于迫使模型进行真正的跨模态因果推理,而非利用语言统计规律或领域常识进行猜测。具体而言,构建过程中的主要挑战包括:确保每个问题的正确答案无法仅通过问题文本和选项中的巴西柔术知识推断,必须依赖图像信息;从符合知识共享许可的教学视频中手动提取具有明确力学原理的关键帧;设计既符合运动实际又具备足够迷惑性的干扰选项,要求所有选项对训练者而言均具合理性,且正确答案在长度和表述上无明显特征;同时需严格控制答案在选项间的均匀分布,避免出现模式化规律。这些挑战共同指向了构建一个无语言捷径、真正考验视觉理解能力的专业基准的复杂性。
常用场景
经典使用场景
在视觉语言模型评估领域,BJJ-VQA数据集被设计为一项专业基准测试,专门用于检验模型在巴西柔术场景下的深度视觉推理能力。其经典使用场景在于,研究者通过呈现柔术教学视频的静态帧,并配以涉及具体力学原理的“为什么”类问题,要求模型必须依据图像中的视觉细节进行因果推断,而非仅仅识别技术名称或依赖文本线索。这种设置有效模拟了真实训练中基于视觉观察进行战术分析的过程,为评估模型在特定垂直领域的细粒度理解提供了标准化平台。
实际应用
在实际应用层面,BJJ-VQA所针对的视觉推理能力可直接迁移至体育教学与辅助训练系统。例如,基于该数据集开发的模型可以用于分析学员的训练录像,自动识别动作中的关键力学细节并提供原理性反馈,从而构建智能化的柔术辅助教练工具。此外,其方法论也能拓展至其他需要基于视觉进行程序性知识推理的领域,如医疗手术指导、工业操作培训等,为开发能够理解复杂物理交互的智能系统提供了可行性验证。
衍生相关工作
围绕BJJ-VQA数据集,相关研究工作主要沿着两个方向展开:一是针对特定领域的视觉语言模型微调与评估框架的构建,研究者通过该基准测试模型在专业垂直领域的泛化与迁移能力;二是其构建方法论启发了更多针对专业技能的视觉问答数据集的创建,例如在武术、舞蹈、手工制作等领域构建类似的“Why-VQA”基准。这些衍生工作共同推动了视觉语言模型在需要深层次领域知识和物理推理的真实世界应用中的发展。
以上内容由遇见数据集搜集并总结生成



