five

carsondial/qwen-8b-embed

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/carsondial/qwen-8b-embed
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit size_categories: - 10M<n<100M task_categories: - sentence-similarity - feature-extraction tags: - embeddings - distillation - qwen3 - retrieval - pynife - leaf configs: - config_name: english-words-definitions data_files: english-words-definitions/*.parquet - config_name: fineweb data_files: fineweb/*.parquet - config_name: gooaq data_files: gooaq/*.parquet - config_name: miracl data_files: miracl/*.parquet - config_name: lotte data_files: lotte/*.parquet - config_name: snli data_files: snli/*.parquet - config_name: paws data_files: paws/*.parquet - config_name: squad data_files: squad/*.parquet - config_name: mldr data_files: mldr/*.parquet - config_name: msmarco data_files: msmarco/*.parquet - config_name: msmarco_docs data_files: msmarco_docs/*.parquet - config_name: PubMedQA data_files: PubMedQA/*.parquet - config_name: swim-ir-monolingual data_files: swim-ir-monolingual/*.parquet - config_name: trivia_qa data_files: trivia_qa/*.parquet - config_name: mr-tydi data_files: mr-tydi/*.parquet dataset_info: - config_name: english-words-definitions features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 466357 - config_name: fineweb features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 2100000 - config_name: gooaq features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 3012496 - config_name: miracl features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 2863 - config_name: lotte features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 13028 - config_name: snli features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 629334 - config_name: paws features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 1291304 - config_name: squad features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 87599 - config_name: mldr features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 10000 - config_name: msmarco features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 1010916 - config_name: msmarco_docs features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 2000000 - config_name: PubMedQA features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 272518 - config_name: swim-ir-monolingual features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 501371 - config_name: trivia_qa features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 87622 - config_name: mr-tydi features: - name: text dtype: string - name: embedding sequence: dtype: float32 length: 1024 - name: role dtype: string splits: - name: train num_examples: 3547 --- # Qwen3-Embedding-8B @ 1024d — full PyNIFE distillation corpus Pre-computed teacher embeddings across **15 source datasets** spanning documents, queries, and symmetric sentence pairs. Designed to replicate the full PyNIFE / LEAF two-stage training recipe against `Qwen/Qwen3-Embedding-8B` as the teacher. ## Headline numbers - **Total rows**: 11,488,955 - **Total tokens embedded**: 1469.8M - **Total cost**: ~$146.98 on Fireworks serverless at $0.10/1M tokens - **Embedding dim**: 1024 (MRL-native; truncate to 256/512 as needed downstream) - **Max input tokens**: 2048 (client-side truncation via Qwen3 tokenizer) - **Normalization**: L2-normalized unit vectors ## Schema (all configs identical) | column | type | |-----------|-------------------------------------| | text | string (possibly truncated to ≤2048 tokens) | | embedding | float32[1024], unit-norm | | role | "doc" \| "query" \| "symmetric" | ## Per-source breakdown | config | upstream | role | rows | tokens | |---|---|---|---|---| | `english-words-definitions` | [MongoDB/english-words-definitions](https://huggingface.co/datasets/MongoDB/english-words-definitions) | doc | 466,357 | 15.6M | | `fineweb` | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | doc | 2,100,000 | 1204.8M | | `gooaq` | [sentence-transformers/gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq) | query | 3,012,496 | 29.8M | | `miracl` | [sentence-transformers/miracl](https://huggingface.co/datasets/sentence-transformers/miracl) | query | 2,863 | 0.0M | | `lotte` | [mteb/lotte](https://huggingface.co/datasets/mteb/lotte) | query | 13,028 | 0.2M | | `snli` | [stanfordnlp/snli](https://huggingface.co/datasets/stanfordnlp/snli) | symmetric | 629,334 | 6.4M | | `paws` | [google-research-datasets/paws](https://huggingface.co/datasets/google-research-datasets/paws) | symmetric | 1,291,304 | 34.5M | | `squad` | [sentence-transformers/squad](https://huggingface.co/datasets/sentence-transformers/squad) | query | 87,599 | 1.1M | | `mldr` | [sentence-transformers/mldr](https://huggingface.co/datasets/sentence-transformers/mldr) | doc | 10,000 | 0.1M | | `msmarco` | [sentence-transformers/msmarco-corpus](https://huggingface.co/datasets/sentence-transformers/msmarco-corpus) | query | 1,010,916 | 7.6M | | `msmarco_docs` | [sentence-transformers/msmarco-corpus](https://huggingface.co/datasets/sentence-transformers/msmarco-corpus) | doc | 2,000,000 | 154.9M | | `PubMedQA` | [qiaojin/PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) | query | 272,518 | 6.2M | | `swim-ir-monolingual` | [nthakur/swim-ir-monolingual](https://huggingface.co/datasets/nthakur/swim-ir-monolingual) | query | 501,371 | 6.9M | | `trivia_qa` | [mandarjoshi/trivia_qa](https://huggingface.co/datasets/mandarjoshi/trivia_qa) | query | 87,622 | 1.7M | | `mr-tydi` | [sentence-transformers/mr-tydi](https://huggingface.co/datasets/sentence-transformers/mr-tydi) | query | 3,547 | 0.0M | ## Two-stage training recipe (following LEAF / PyNIFE) Interleaving docs and queries during distillation does **not** work well (see Tulkens' README). The recommended recipe is: 1. **Pretrain** on doc-like sources: concatenate the configs where `role == "doc"` (`msmarco_docs`, `mldr`, `fineweb`, `english-words-definitions`). 2. **Finetune** with a lower learning rate on query-like sources: concatenate the configs where `role == "query"` (`msmarco`, `gooaq`, `squad`, `swim-ir-monolingual`, `trivia_qa`, `PubMedQA`, `miracl`, `mr-tydi`, `lotte`). The `symmetric` sources (`snli`, `paws`) are sentence-pair corpora useful for STS-style alignment; use at your discretion. ```python from datasets import load_dataset, concatenate_datasets REPO = "REPLACE_WITH_HF_REPO_ID" # Stage 1: documents doc_configs = ["msmarco_docs", "mldr", "fineweb", "english-words-definitions"] doc_train = concatenate_datasets([ load_dataset(REPO, c, split="train") for c in doc_configs ]) # Stage 2: queries query_configs = ["msmarco", "gooaq", "squad", "swim-ir-monolingual", "trivia_qa", "PubMedQA", "miracl", "mr-tydi", "lotte"] query_train = concatenate_datasets([ load_dataset(REPO, c, split="train") for c in query_configs ]) ``` ## Why no instruction prompt? Per PyNIFE's empirical finding: static models cannot use instructions meaningfully, because with no cross-token interaction the instruction prompt can only produce a constant offset in embedding space — invisible to cosine similarity ranking. So teacher embeddings here are computed on plain text. ## Asymmetric retrieval pattern This corpus is raw material for an **asymmetric** architecture: expensive teacher for document indexing, cheap distilled student for online queries. See [PyNIFE](https://github.com/stephantul/pynife) and [LEAF](https://arxiv.org/abs/2509.12539) for the theory. ## Reproducibility Generated by `build_corpus.py`. Deterministic within a given set of upstream dataset snapshots. Fireworks `accounts/fireworks/models/qwen3-embedding-8b` with `dimensions=1024`; vectors re-normalized L2 client-side after receipt (MRL truncation returns non-unit vectors).
提供机构:
carsondial
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作