five

FineWeb-PosQ

收藏
魔搭社区2025-06-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/NovaSearch/FineWeb-PosQ
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for FineWeb-PosQ - **Paper:** [Benchmarking the Myopic Trap: Positional Bias in Information Retrieval](https://arxiv.org/abs/2505.13950) - **Repository:** https://github.com/NovaSearch-Team/RAG-Retrieval/tree/master/examples/MyopicTrap - **License:** ODC-BY - **Languages:** English ## Dataset Summary **FineWeb-PosQ** is a synthetic QA dataset designed to evaluate **position-sensitive retrieval**, a task that assesses a retrieval model's robustness to variations in the position of query-relevant information within a passage. It is constructed using passages sampled from **FineWeb-edu**, a large-scale, high-quality educational web corpus. We selected 13,902 passages ranging from 500 to 1,024 words in length. For each passage, we use `gpt-4o-mini` to generate: * A **global summary** of the entire passage. * Multiple **position-aware question–answer pairs**, grounded in localized chunks of the passage. To facilitate position-aware analysis, each passage is segmented into three equal-length parts: **beginning**, **middle**, and **end**. Each question–answer pair is labeled with the segment(s) corresponding to the answer’s source chunk. If a chunk spans multiple segments, multiple labels are applied to reflect ambiguity. ## Dataset Structure ### Data Fields * `question` (`string`): A position-aware question generated based on a localized chunk of the passage. * `content` (`string`): The full text of the passage. * `content_summary` (`string`): A globally generated summary of the passage by a large language model. * `answer` (`string`): The answer extracted from a specific chunk of the passage. * `question_level` (`string`): The difficulty level of the question (e.g., simple, complicated). * `span` (`sequence[int32]`): The start and end character positions of the answer chunk within the passage. * `span_class` (`sequence[string]`): One or more positional tags indicating where the answer chunk is located in the passage (e.g., beginning, middle, end). ### Data Splits | Split | Examples | | ----- | -------- | | train | 265,865 | ## Citation If you use this dataset in your research, please cite the associated paper: ```bibtex @misc{zeng2025myopictrap, title={Benchmarking the Myopic Trap: Positional Bias in Information Retrieval}, author={Ziyang Zeng and Dun Zhang and Jiacheng Li and Panxiang Zou and Yuqing Yang}, year={2025}, eprint={2505.13950}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2505.13950}, } ```

# FineWeb-PosQ 数据集卡片 - **论文:** [《Benchmarking the Myopic Trap: Positional Bias in Information Retrieval》](https://arxiv.org/abs/2505.13950) - **代码仓库:** https://github.com/NovaSearch-Team/RAG-Retrieval/tree/master/examples/MyopicTrap - **许可协议:** ODC-BY - **语言:** 英语 ## 数据集概览 **FineWeb-PosQ** 是一款专为评估**位置敏感检索(position-sensitive retrieval)**而设计的合成问答(Question Answering, QA)数据集,该任务用于衡量检索模型对查询相关信息在文段中位置变化的鲁棒性。 本数据集的文段样本源自**FineWeb-edu**——一个大规模、高质量的教育网络语料库。我们共筛选出13902段长度介于500至1024词之间的文段。 针对每一段文段,我们使用`gpt-4o-mini`生成两类内容: * 覆盖整个文段的**全局摘要**; * 多个基于文段局部文本块的**位置感知问答对**。 为便于开展位置感知相关分析,我们将每段文段均分为三个等长部分:**开头段(beginning)**、**中段(middle)**与**结尾段(end)**。每个问答对均标注答案来源文本块对应的文段分段;若某文本块跨越多个分段,则会标注多个标签以反映其位置歧义性。 ## 数据集结构 ### 数据字段 * `question`(`string`):基于文段局部文本块生成的位置感知问题。 * `content`(`string`):文段的完整文本。 * `content_summary`(`string`):由大语言模型(Large Language Model)生成的文段全局摘要。 * `answer`(`string`):从文段特定文本块中提取的答案。 * `question_level`(`string`):问题的难度等级(例如简单、复杂)。 * `span`(`sequence[int32]`):答案文本块在文段中的起始与结束字符位置。 * `span_class`(`sequence[string]`):一个或多个位置标签,用于指明答案文本块在文段中的位置(例如开头、中段、结尾)。 ### 数据划分 | 划分类型 | 样本数量 | | ------ | -------- | | 训练集(train) | 265,865 | ## 引用说明 若您在研究中使用该数据集,请引用相关论文: bibtex @misc{zeng2025myopictrap, title={Benchmarking the Myopic Trap: Positional Bias in Information Retrieval}, author={Ziyang Zeng and Dun Zhang and Jiacheng Li and Panxiang Zou and Yuqing Yang}, year={2025}, eprint={2505.13950}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2505.13950}, }
提供机构:
maas
创建时间:
2025-05-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作