NovaSearch/FineWeb-PosQ_raw
收藏Hugging Face2025-05-22 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/NovaSearch/FineWeb-PosQ_raw
下载链接
链接失效反馈官方服务:
资源简介:
FineWeb-PosQ是一个合成的问答数据集,设计用于评估位置敏感检索,即评估检索模型对查询相关信息在段落中位置变化的鲁棒性。该数据集使用从大规模高质量教育网络语料库FineWeb-edu中采样的段落构建而成,共选择了13,902个长度在500到1,024词之间的段落。对于每个段落,使用gpt-4o-mini生成了一个全局摘要和多个位置感知的问答对。为了进行位置感知分析,每个段落被分割成三个等长的部分:开头、中间和结尾。每个问答对都被标记上与答案源块对应的段落的标签。
FineWeb-PosQ is a synthetic QA dataset designed to evaluate position-sensitive retrieval, which assesses the robustness of a retrieval model to variations in the position of query-relevant information within a passage. The dataset is constructed using passages sampled from FineWeb-edu, a large-scale, high-quality educational web corpus, with 13,902 passages ranging from 500 to 1,024 words in length. For each passage, a global summary and multiple position-aware question-answer pairs are generated using gpt-4o-mini. To facilitate position-aware analysis, each passage is segmented into three equal-length parts: beginning, middle, and end. Each question-answer pair is labeled with the segment(s) corresponding to the answer’s source chunk.
提供机构:
NovaSearch



