OpenResearcher/OpenResearcher-Indexes

Name: OpenResearcher/OpenResearcher-Indexes
Creator: OpenResearcher
Published: 2026-03-24 12:43:01
License: 暂无描述

Hugging Face2026-03-24 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/OpenResearcher/OpenResearcher-Indexes

下载链接

链接失效反馈

官方服务：

资源简介：

<div style="display: flex; align-items: center; justify-content: center; gap: 8px;"> <img src="imgs/or-logo1.png" style="height: 84px; width: auto;"> <img src="imgs/openresearcher-title.svg" style="height: 84px; width: auto;"> </div> <div align="center"> <a href="https://arxiv.org/abs/2603.20278"><img src="https://img.shields.io/badge/arXiv-B31B1B?style=for-the-badge&logo=arXiv&logoColor=white" alt="Blog"></a> <a href="https://huggingface.co/papers/2603.20278"><img src="https://img.shields.io/badge/Paper-FFD966?style=for-the-badge&logo=huggingface&logoColor=ffffff" alt="Model"></a>    <a href="https://github.com/TIGER-AI-Lab/OpenResearcher"><img src="https://img.shields.io/badge/Github-181717?style=for-the-badge&logo=github&logoColor=white" alt="Blog"></a> <a href="https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Dataset"><img src="https://img.shields.io/badge/Dataset-FFB7B2?style=for-the-badge&logo=huggingface&logoColor=ffffff" alt="Dataset"></a> <a href="https://huggingface.co/OpenResearcher/Nemotron-3-Nano-30B-A3B"><img src="https://img.shields.io/badge/Model-FFD966?style=for-the-badge&logo=huggingface&logoColor=ffffff" alt="Model"></a> <a href="https://huggingface.co/spaces/OpenResearcher/OpenResearcher"><img src="https://img.shields.io/badge/Demo-F97316.svg?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo"></a>  <a href="https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Eval-Logs/tree/main"><img src="https://img.shields.io/badge/Eval%20Logs-755BB4?style=for-the-badge&logo=google-sheets&logoColor=white" alt="Eval Logs"></a> </div> </div> <div align="center" style="padding: 10px 0 -4px; display: flex; align-items: center; justify-content: center; gap: 16px;"> <div style="width: 60px; height: 2px; background: linear-gradient(90deg, transparent, #E24B4A);"></div> <span style="font-size: 22px; font-weight: 600; color: #E24B4A;">Adopted by NVIDIA's Nemotron family of models!</span> <div style="width: 60px; height: 2px; background: linear-gradient(90deg, #E24B4A, transparent);"></div> </div> <p align="center"> 🤗 <a href="https://huggingface.co/collections/TIGER-Lab/openresearcher" target="_blank">HuggingFace</a> ｜ <img src="imgs/slack.png" width="14px" style="display:inline;"> <a href="https://join.slack.com/t/openresearcher/shared_invite/zt-3p0r32cky-PqtZkVjjWIAI14~XwcRMfQ" target="_blank">Slack</a> | <img src="imgs/wechat.svg" width="14px" style="display:inline;"> <a href="https://github.com/TIGER-AI-Lab/OpenResearcher/blob/main/assets/imgs/wechat_group.jpg" target="_blank">WeChat</a> </p> ## OpenResearcher Indexes This dataset provides [OpenResearcher corpus](https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Corpus) embeddings generated from [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) for building an offline search engine. ## Format This dataset contains pre-computed embedding indexes stored as pickle files. Each `.pkl` file contains a tuple of: + **embeddings** (numpy.ndarray): Dense vector representations of documents, shape `(n_docs, embedding_dim)`. Generated using [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B). + **lookup** (list): A list of docids corresponding to each embedding vector, used to retrieve the original document from the [corpus](https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Corpus). ## How to use this dataset? You can use this dataset together with its [corpus](https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Corpus) to build an offline search engine. Below is a pseduo code for **demonstration only** (for production use, consider [Faiss-GPU](https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU)). ```bash # download index before huggingface-cli download OpenResearcher/OpenResearcher-Corpus --repo-type=dataset --include="qwen3-embedding-8b/*" --local-dir ./indexes ``` ```python import glob import pickle import faiss import numpy as np from datasets import load_dataset from sentence_transformers import SentenceTransformer # 1. Load corpus corpus = load_dataset("OpenResearcher/OpenResearcher-Corpus", split="train") docid_to_doc = {str(doc["docid"]): doc for doc in corpus} # 2. Load all embedding shards from OpenResearcher-Indexes index_files = sorted(glob.glob("path/to/indexes/*.pkl")) all_embeddings = [] all_lookup = [] for file_path in index_files: with open(file_path, "rb") as f: embeddings, lookup = pickle.load(f) all_embeddings.append(embeddings) all_lookup.extend(lookup) all_embeddings = np.vstack(all_embeddings).astype(np.float32) faiss.normalize_L2(all_embeddings) # Normalize for cosine similarity # 3. Build FAISS index index = faiss.IndexFlatIP(all_embeddings.shape[1]) index.add(all_embeddings) # 4. Load model and encode query model = SentenceTransformer("Qwen/Qwen3-Embedding-8B") query = "What is machine learning?" query_embedding = model.encode([query], prompt_name="query") # 5. Search in FAISS scores, indices = index.search(query_embedding, k=5) # 6. Print results for idx, score in zip(indices[0], scores[0]): docid = str(all_lookup[idx]) doc = docid_to_doc.get(docid) if doc: print(f"Score: {score:.4f}") print(f"URL: {doc['url']}") print(f"Text: {doc['text'][:200]}...\n") ``` ## Citation ```bibtex @article{li2026openresearcher, title={{OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis}}, author={Li, Zhuofeng and Jiang, Dongfu and Ma, Xueguang and Zhang, Haoxiang and Nie, Ping and Zhang, Yuyu and Zou, Kai and Xie, Jianwen and Zhang, Yu and Chen, Wenhu}, journal={arXiv preprint arXiv:2603.20278}, year={2026} } ```

提供机构：

OpenResearcher

5,000+

优质数据集

54 个

任务类型

进入经典数据集