CoE-Wiki
收藏魔搭社区2025-11-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/anonymou1111/CoE-Wiki
下载链接
链接失效反馈官方服务:
资源简介:
```markdown
# Wiki-CoE Dataset
## Overview
**Wiki-CoE** (Wiki-Chain of Evidence) is the first large-scale visual evidence localization benchmark for multi-hop reasoning, featuring:
- **51,086 multi-hop questions** (40k train / 11k test)
- **70,088 high-resolution Wikipedia screenshots**
- **126,323 precisely annotated evidence bounding boxes**
Key features:
- Preserves original Wikipedia layouts (tables, infoboxes, images)
- Covers 4 types of complex reasoning questions
- Provides pixel-level evidence attribution for iterative RAG systems
## Data Collection
### 1. Document Acquisition
- **Tool**: Selenium WebDriver (preserves full CSS styling)
- **Sampling Strategy**: Priority-based crawling of high-frequency entities
- **Coverage**: 70,088 valid screenshots from 80k attempted crawls
### 2. Annotation Pipeline
**Three-stage annotation workflow**:
1. **Text Anchoring**: Precise text matching for evidence sentences
2. **Visual Mapping**: Bounding box generation for cross-modal alignment
3. **Consistency Checks**: Spatial coherence validation for reading order
### 3. Quality Control
- **Webpage Integrity**: JS rendering/missing image validation
- **Annotation Verification**: Text-visual correspondence checks
- **Dynamic Filtering**: HTML diff analysis for stale content removal
## Dataset Statistics
| Metric | Train | Test | Total |
|-----------------------|------------|------------|------------|
| Questions | 40,000 | 11,086 | 51,086 |
| Avg. Question Length | 13.18 | 13.10 | 13.16 |
| Evidence Boxes | 99,048 | 27,275 | 126,323 |
| Multi-hop Ratio | 76.2% | 77.1% | 76.4% |
| 4+ Hop Questions | 23.7% | 22.8% | 23.5% |
**Question Type Distribution**:
- Bridge Comparison: 23.5%
- Inference: 2.01%
- Compositional: 69.4%
# Wiki-CoE 数据集
## 概览
**Wiki-CoE**(维基证据链,Wiki-Chain of Evidence)是首个面向多跳推理的大规模视觉证据定位基准数据集,具备以下特性:
- **51,086个多跳问题**(训练集40k / 测试集11k)
- **70,088张高分辨率维基百科页面截图**
- **126,323个精准标注的证据边界框**
核心特性包括:
- 保留维基百科原始版式(包含表格、信息框、内嵌图片)
- 涵盖4类复杂推理问题
- 为迭代式检索增强生成(Retrieval-Augmented Generation, RAG)系统提供像素级证据归因
## 数据采集
### 1. 文档获取
- **工具**:Selenium WebDriver(可完整保留CSS样式)
- **采样策略**:针对高频实体的优先级爬虫方案
- **覆盖范围**:80k次尝试爬取后共获取70,088张有效截图
### 2. 标注流程
**三阶段标注工作流**:
1. **文本锚定**:针对证据语句的精准文本匹配
2. **视觉映射**:用于跨模态对齐的边界框生成
3. **一致性校验**:针对阅读顺序的空间连贯性验证
### 3. 质量控制
- **网页完整性校验**:JavaScript渲染效果及缺失图片检查
- **标注验证**:文本与视觉对应关系核查
- **动态过滤**:通过HTML差异分析剔除过时内容
## 数据集统计
| 指标 | 训练集 | 测试集 | 总计 |
|---------------------|------------|------------|------------|
| 问题数量 | 40,000 | 11,086 | 51,086 |
| 平均问题长度 | 13.18 | 13.10 | 13.16 |
| 证据边界框数量 | 99,048 | 27,275 | 126,323 |
| 多跳问题占比 | 76.2% | 77.1% | 76.4% |
| 4跳及以上问题占比 | 23.7% | 22.8% | 23.5% |
**问题类型分布**:
- 桥接比较类:23.5%
- 推理类:2.01%
- 组合类:69.4%
提供机构:
maas
创建时间:
2025-11-05



