FineWeb-PosQ_raw
收藏魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/NovaSearch/FineWeb-PosQ_raw
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for FineWeb-PosQ
- **Paper:** [Benchmarking the Myopic Trap: Positional Bias in Information Retrieval](https://arxiv.org/abs/2505.13950)
- **Repository:** https://github.com/NovaSearch-Team/RAG-Retrieval/tree/master/examples/MyopicTrap
- **License:** ODC-BY
- **Languages:** English
## Dataset Summary
**FineWeb-PosQ** is a synthetic QA dataset designed to evaluate **position-sensitive retrieval**, a task that assesses a retrieval model's robustness to variations in the position of query-relevant information within a passage.
It is constructed using passages sampled from **FineWeb-edu**, a large-scale, high-quality educational web corpus.
We selected 13,902 passages ranging from 500 to 1,024 words in length.
For each passage, we use `gpt-4o-mini` to generate:
* A **global summary** of the entire passage.
* Multiple **position-aware question–answer pairs**, grounded in localized chunks of the passage.
To facilitate position-aware analysis, each passage is segmented into three equal-length parts: **beginning**, **middle**, and **end**.
Each question–answer pair is labeled with the segment(s) corresponding to the answer’s source chunk.
If a chunk spans multiple segments, multiple labels are applied to reflect ambiguity.
## Dataset Structure
### Data Fields
* `question` (`string`): A position-aware question generated based on a localized chunk of the passage.
* `content` (`string`): The full text of the passage.
* `content_summary` (`string`): A globally generated summary of the passage by a large language model.
* `answer` (`string`): The answer extracted from a specific chunk of the passage.
* `question_level` (`string`): The difficulty level of the question (e.g., simple, complicated).
* `span` (`sequence[int32]`): The start and end character positions of the answer chunk within the passage.
* `span_class` (`sequence[string]`): One or more positional tags indicating where the answer chunk is located in the passage (e.g., beginning, middle, end).
### Data Splits
| Split | Examples |
| ----- | -------- |
| train | 265,865 |
## Citation
If you use this dataset in your research, please cite the associated paper:
```bibtex
@misc{zeng2025myopictrap,
title={Benchmarking the Myopic Trap: Positional Bias in Information Retrieval},
author={Ziyang Zeng and Dun Zhang and Jiacheng Li and Panxiang Zou and Yuqing Yang},
year={2025},
eprint={2505.13950},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.13950},
}
```
# FineWeb-PosQ 数据集卡片
- **论文:** [《基准测试短视陷阱:信息检索中的位置偏差(Benchmarking the Myopic Trap: Positional Bias in Information Retrieval)》](https://arxiv.org/abs/2505.13950)
- **代码仓库:** https://github.com/NovaSearch-Team/RAG-Retrieval/tree/master/examples/MyopicTrap
- **许可证:** ODC-BY
- **语言:** 英语
## 数据集概述
**FineWeb-PosQ** 是一款用于评估**位置敏感检索(position-sensitive retrieval)**的合成问答(Question Answering, QA)数据集,该任务用于衡量检索模型对查询相关信息在文档段落中位置变化的鲁棒性。
该数据集基于大规模高质量教育网页语料库**FineWeb-edu**中采样的段落构建而成。我们共选取了13,902段长度介于500至1024词之间的段落。
针对每一段落,我们使用`gpt-4o-mini`生成以下内容:
* 覆盖整个段落的**全局摘要(global summary)**
* 多个基于段落局部片段的**位置感知问答对(position-aware question–answer pairs)**
为便于开展位置感知分析,我们将每一段落均分为三个等长部分:**开头段(beginning)**、**中段(middle)**与**结尾段(end)**。每个问答对均会标注答案来源片段所属的段落部分;若某一片段跨越多个段落部分,则会添加多个标签以体现其位置歧义性。
## 数据集结构
### 数据字段
* `question`(`string`类型):基于段落局部片段生成的位置感知问题
* `content`(`string`类型):段落的完整文本
* `content_summary`(`string`类型):由大语言模型(Large Language Model, LLM)生成的段落全局摘要
* `answer`(`string`类型):从段落特定片段中提取的答案
* `question_level`(`string`类型):问题的难度等级(例如:简单、复杂)
* `span`(`sequence[int32]`):答案片段在段落中的起始与结束字符位置
* `span_class`(`sequence[string]`):一个或多个位置标签,用于指示答案片段在段落中的位置(例如:开头、中段、结尾)
### 数据划分
| 划分方式 | 样本数量 |
| ----- | -------- |
| 训练集(train) | 265,865 |
## 引用说明
若您在研究中使用该数据集,请引用相关论文:
bibtex
@misc{zeng2025myopictrap,
title={Benchmarking the Myopic Trap: Positional Bias in Information Retrieval},
author={Ziyang Zeng and Dun Zhang and Jiacheng Li and Panxiang Zou and Yuqing Yang},
year={2025},
eprint={2505.13950},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2505.13950},
}
提供机构:
maas
创建时间:
2025-07-03



