PahaII/sft-corpus-qwen3-index

Name: PahaII/sft-corpus-qwen3-index
Creator: PahaII
Published: 2026-04-18 04:56:43
License: 暂无描述

Hugging Face2026-04-18 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/PahaII/sft-corpus-qwen3-index

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other language: - en pretty_name: SFT Corpus — Qwen3-Embedding-8B FAISS Index tags: - retrieval - faiss - qwen3-embedding size_categories: - 1M<n<10M --- # SFT Corpus — Qwen3-Embedding-8B FAISS Index A dense retrieval index for a 7.37 M-document web corpus, embedded with [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) (4096-dim, fp32, L2-normalized). Designed to be served with [FAISS-GPU](https://github.com/facebookresearch/faiss) behind a small FastAPI orchestrator that pairs it with the Qwen3 embedding model for query-time embedding. ## Contents | Path | Size | What it is | |--------------------------------|---------|----------------------------------------------------------| | `index/corpus.faiss.part-aa` | ~40 GB | FAISS flat index (fp32), part 1 of 3 | | `index/corpus.faiss.part-ab` | ~40 GB | FAISS flat index (fp32), part 2 of 3 | | `index/corpus.faiss.part-ac` | ~33 GB | FAISS flat index (fp32), part 3 of 3 | | `corpus/train-*.parquet` | ~1.4 GB | 148 parquet files, columns `docid`, `url`, `text`, ... | `corpus.faiss` was split because Hugging Face's LFS storage limits a single file to 50 GB. Reassemble with: ```bash cat corpus.faiss.part-aa corpus.faiss.part-ab corpus.faiss.part-ac > corpus.faiss sha256sum corpus.faiss # should match the sha256 in sha256.txt ``` Total reassembled size: ~113 GB. ## Download ```bash pip install huggingface_hub hf_transfer HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \ --repo-type dataset PahaII/sft-corpus-qwen3-index \ --local-dir ./sft-corpus-qwen3-index ``` ## Serving The 113 GB fp32 index does **not** fit on a single A100-80GB — it must be sharded across multiple GPUs with `faiss.index_cpu_to_gpu_multiple_py(..., co.shard=True)`. A ready-to-go 4-GPU launcher lives in the source repo: - FAISS server with `--gpu_ids 0,1,2,3`: `scripts/faiss_gpu_server.py` - Qwen3-Embedding-8B vLLM on 4 more GPUs (port 8091) - Orchestrator that stitches both (port 8090) See `docs/04172026_search_service_launch.md` in the companion code repo for the full setup. ## Minimum hardware to serve - 4× A100-80GB (sharded FAISS) or 1× H100-96GB (may fit at fp16 with `co.useFloat16=True`, untested) - \+ 4× A100-80GB for the Qwen3-Embedding-8B vLLM (TP=4, bf16) ## Corpus provenance Web documents covering books/media/reference domains. Exact license of the underlying text varies per source URL — use appropriately for research.

--- 许可证：其他语言： - 英语展示名称：SFT语料库 — Qwen3-Embedding-8B FAISS索引标签： - 检索 - FAISS - Qwen3嵌入数据规模类别： - 100万<样本数<1000万 --- # SFT语料库 — Qwen3-Embedding-8B FAISS索引本数据集为包含737万份网页文档的语料库构建的稠密检索索引，嵌入模型采用[Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B)（维度为4096，数据格式为fp32，已完成L2归一化）。该索引适配基于[FAISS-GPU](https://github.com/facebookresearch/faiss)的部署方案，可通过轻量FastAPI编排器与Qwen3嵌入模型配合，实现查询阶段的嵌入生成。 ## 内容说明 | 路径 | 大小 | 内容说明 | |--------------------------------|---------|----------------------------------------------------------| | `index/corpus.faiss.part-aa` | 约40 GB | FAISS扁平索引（fp32格式），共3个分卷中的第1卷 | | `index/corpus.faiss.part-ab` | 约40 GB | FAISS扁平索引（fp32格式），共3个分卷中的第2卷 | | `index/corpus.faiss.part-ac` | 约33 GB | FAISS扁平索引（fp32格式），共3个分卷中的第3卷 | | `corpus/train-*.parquet` | 约1.4 GB | 共148个Parquet文件，包含`docid`、`url`、`text`等字段 | 由于Hugging Face的LFS存储限制单文件大小不超过50GB，因此将`corpus.faiss`拆分为多个分卷。可通过以下命令重组为完整索引： bash cat corpus.faiss.part-aa corpus.faiss.part-ab corpus.faiss.part-ac > corpus.faiss sha256sum corpus.faiss # 请与sha256.txt中的校验和比对以验证文件完整性重组后的总大小约为113 GB。 ## 下载方式 bash pip install huggingface_hub hf_transfer HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download --repo-type dataset PahaII/sft-corpus-qwen3-index --local-dir ./sft-corpus-qwen3-index ## 部署方案该113 GB的fp32格式索引**无法在单张A100-80GB显卡上完整加载**，需通过`faiss.index_cpu_to_gpu_multiple_py(..., co.shard=True)`将索引分片部署至多张GPU。配套代码仓库中提供了适配4张GPU的一键启动脚本： - 搭载`--gpu_ids 0,1,2,3`参数的FAISS服务端：`scripts/faiss_gpu_server.py` - 部署于另外4张GPU的Qwen3-Embedding-8B vLLM服务（端口8091） - 实现两者协同的编排服务（端口8090）完整部署流程请参阅配套代码仓库中的`docs/04172026_search_service_launch.md`文档。 ## 最低部署硬件要求 - 4张A100-80GB显卡（用于分片部署FAISS索引），或1张H100-96GB显卡（若设置`co.useFloat16=True`转为fp16格式或可容纳，暂未验证） - 额外需要4张A100-80GB显卡用于运行Qwen3-Embedding-8B vLLM服务（张量并行度TP=4，bf16格式） ## 语料来源说明本语料涵盖图书、媒体、参考资料等领域的网页文档。原始文本的具体许可证因来源URL而异，请合规用于科研用途。

提供机构：

PahaII

5,000+

优质数据集

54 个

任务类型

进入经典数据集