five

OpenResearcher/OpenResearcher-Indexes

收藏
Hugging Face2026-03-24 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/OpenResearcher/OpenResearcher-Indexes
下载链接
链接失效反馈
官方服务:
资源简介:
<div style="display: flex; align-items: center; justify-content: center; gap: 8px;"> <img src="imgs/or-logo1.png" style="height: 84px; width: auto;"> <img src="imgs/openresearcher-title.svg" style="height: 84px; width: auto;"> </div> <div align="center"> <a href="https://arxiv.org/abs/2603.20278"><img src="https://img.shields.io/badge/arXiv-B31B1B?style=for-the-badge&logo=arXiv&logoColor=white" alt="Blog"></a> <a href="https://huggingface.co/papers/2603.20278"><img src="https://img.shields.io/badge/Paper-FFD966?style=for-the-badge&logo=huggingface&logoColor=ffffff" alt="Model"></a> <!-- <a href="https://huggingface.co/papers/2603.20278"><img src="https://img.shields.io/badge/arXiv-B31B1B?style=for-the-badge&logo=arXiv&logoColor=white" alt="Blog"></a> --> <!-- <a href="https://x.com/DongfuJiang/status/2020946549422031040"><img src="https://img.shields.io/badge/Twitter-000000?style=for-the-badge&logo=X&logoColor=white" alt="Blog"></a> --> <!-- <a href="https://boiled-honeycup-4c7.notion.site/OpenResearcher-A-Fully-Open-Pipeline-for-Long-Horizon-Deep-Research-Trajectory-Synthesis-2f7e290627b5800cb3a0cd7e8d6ec0ea?source=copy_link"><img src="https://img.shields.io/badge/Blog-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white" alt="Blog"></a> --> <a href="https://github.com/TIGER-AI-Lab/OpenResearcher"><img src="https://img.shields.io/badge/Github-181717?style=for-the-badge&logo=github&logoColor=white" alt="Blog"></a> <a href="https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Dataset"><img src="https://img.shields.io/badge/Dataset-FFB7B2?style=for-the-badge&logo=huggingface&logoColor=ffffff" alt="Dataset"></a> <a href="https://huggingface.co/OpenResearcher/Nemotron-3-Nano-30B-A3B"><img src="https://img.shields.io/badge/Model-FFD966?style=for-the-badge&logo=huggingface&logoColor=ffffff" alt="Model"></a> <a href="https://huggingface.co/spaces/OpenResearcher/OpenResearcher"><img src="https://img.shields.io/badge/Demo-F97316.svg?style=for-the-badge&logo=gradio&logoColor=white" alt="Demo"></a> <!-- <a href="https://wandb.ai/dongfu/nano-v3-sft-search"><img src="https://img.shields.io/badge/WandB%20Logs-48B5A3?style=for-the-badge&logo=weightsandbiases&logoColor=white" alt="WandB Logs"></a> --> <a href="https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Eval-Logs/tree/main"><img src="https://img.shields.io/badge/Eval%20Logs-755BB4?style=for-the-badge&logo=google-sheets&logoColor=white" alt="Eval Logs"></a> </div> </div> <div align="center" style="padding: 10px 0 -4px; display: flex; align-items: center; justify-content: center; gap: 16px;"> <div style="width: 60px; height: 2px; background: linear-gradient(90deg, transparent, #E24B4A);"></div> <span style="font-size: 22px; font-weight: 600; color: #E24B4A;">Adopted by NVIDIA's Nemotron family of models!</span> <div style="width: 60px; height: 2px; background: linear-gradient(90deg, #E24B4A, transparent);"></div> </div> <p align="center"> 🤗 <a href="https://huggingface.co/collections/TIGER-Lab/openresearcher" target="_blank">HuggingFace</a> | <img src="imgs/slack.png" width="14px" style="display:inline;"> <a href="https://join.slack.com/t/openresearcher/shared_invite/zt-3p0r32cky-PqtZkVjjWIAI14~XwcRMfQ" target="_blank">Slack</a> | <img src="imgs/wechat.svg" width="14px" style="display:inline;"> <a href="https://github.com/TIGER-AI-Lab/OpenResearcher/blob/main/assets/imgs/wechat_group.jpg" target="_blank">WeChat</a> </p> ## OpenResearcher Indexes This dataset provides [OpenResearcher corpus](https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Corpus) embeddings generated from [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) for building an offline search engine. ## Format This dataset contains pre-computed embedding indexes stored as pickle files. Each `.pkl` file contains a tuple of: + **embeddings** (numpy.ndarray): Dense vector representations of documents, shape `(n_docs, embedding_dim)`. Generated using [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B). + **lookup** (list): A list of docids corresponding to each embedding vector, used to retrieve the original document from the [corpus](https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Corpus). ## How to use this dataset? You can use this dataset together with its [corpus](https://huggingface.co/datasets/OpenResearcher/OpenResearcher-Corpus) to build an offline search engine. Below is a pseduo code for **demonstration only** (for production use, consider [Faiss-GPU](https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU)). ```bash # download index before huggingface-cli download OpenResearcher/OpenResearcher-Corpus --repo-type=dataset --include="qwen3-embedding-8b/*" --local-dir ./indexes ``` ```python import glob import pickle import faiss import numpy as np from datasets import load_dataset from sentence_transformers import SentenceTransformer # 1. Load corpus corpus = load_dataset("OpenResearcher/OpenResearcher-Corpus", split="train") docid_to_doc = {str(doc["docid"]): doc for doc in corpus} # 2. Load all embedding shards from OpenResearcher-Indexes index_files = sorted(glob.glob("path/to/indexes/*.pkl")) all_embeddings = [] all_lookup = [] for file_path in index_files: with open(file_path, "rb") as f: embeddings, lookup = pickle.load(f) all_embeddings.append(embeddings) all_lookup.extend(lookup) all_embeddings = np.vstack(all_embeddings).astype(np.float32) faiss.normalize_L2(all_embeddings) # Normalize for cosine similarity # 3. Build FAISS index index = faiss.IndexFlatIP(all_embeddings.shape[1]) index.add(all_embeddings) # 4. Load model and encode query model = SentenceTransformer("Qwen/Qwen3-Embedding-8B") query = "What is machine learning?" query_embedding = model.encode([query], prompt_name="query") # 5. Search in FAISS scores, indices = index.search(query_embedding, k=5) # 6. Print results for idx, score in zip(indices[0], scores[0]): docid = str(all_lookup[idx]) doc = docid_to_doc.get(docid) if doc: print(f"Score: {score:.4f}") print(f"URL: {doc['url']}") print(f"Text: {doc['text'][:200]}...\n") ``` ## Citation ```bibtex @article{li2026openresearcher, title={{OpenResearcher: A Fully Open Pipeline for Long-Horizon Deep Research Trajectory Synthesis}}, author={Li, Zhuofeng and Jiang, Dongfu and Ma, Xueguang and Zhang, Haoxiang and Nie, Ping and Zhang, Yuyu and Zou, Kai and Xie, Jianwen and Zhang, Yu and Chen, Wenhu}, journal={arXiv preprint arXiv:2603.20278}, year={2026} } ```
提供机构:
OpenResearcher
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作