reasonir-data
收藏数据集概述
基本信息
- 语言: 英文 (en)
- 许可证: CC-BY-NC-4.0
- 任务类别: 文本检索 (text-retrieval)
数据集配置
配置1: HQ (Hard-Query)
- 特征:
query: 字符串序列pos: 字符串序列的序列neg: 字符串序列的序列
- 数据分割:
train:- 样本数: 100,521
- 字节数: 247,508,395
- 下载大小: 119,301,419
- 数据集大小: 247,508,395
配置2: VL (Varied-Length)
- 特征:
query: 字符串序列pos: 字符串序列的序列neg: 字符串序列的序列
- 数据分割:
train:- 样本数: 244,970
- 字节数: 394,291,762
- 下载大小: 221,875,294
- 数据集大小: 394,291,762
相关资源
- 论文: https://arxiv.org/abs/2504.20595
- 代码: https://github.com/facebookresearch/ReasonIR
- 模型: https://huggingface.co/reasonir/ReasonIR-8B
数据加载说明
VL 数据集
python from datasets import load_dataset vl_dataset = load_dataset("reasonir/reasonir-data", "vl")
HQ 数据集
由于无法重新托管原始正文档,需结合 BRIGHT 数据集加载: python from datasets import load_dataset
def get_doc_and_ids(doc_pairs): doc_ids = [] documents = [] for dp in doc_pairs: doc_ids.append(str(dp[id])) documents.append(dp[content]) return documents, doc_ids
def process_pos_id2doc(entry, id2doc): pos_docs = entry["pos"] res = [] for pos in pos_docs: instruction, doc_id = pos[0], pos[1] doc = id2doc[doc_id] res.append([instruction, doc]) entry["pos"] = res return entry
hq_dataset = load_dataset("reasonir/reasonir-data", "hq")
bright_docs = load_dataset("xlangai/BRIGHT", "documents")
all_docs = []
all_ids = []
for task in bright_docs.keys():
docs, ids = get_doc_and_ids(bright_docs[task])
all_docs.extend(docs)
all_ids.extend(ids)
id2doc = {} for i in range(len(all_docs)): id2doc[all_ids[i]] = all_docs[i]
hq_dataset = hq_dataset.map(lambda x: process_pos_id2doc(x, id2doc))




