carsondial/qwen-8b-embed
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/carsondial/qwen-8b-embed
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: mit
size_categories:
- 10M<n<100M
task_categories:
- sentence-similarity
- feature-extraction
tags:
- embeddings
- distillation
- qwen3
- retrieval
- pynife
- leaf
configs:
- config_name: english-words-definitions
data_files: english-words-definitions/*.parquet
- config_name: fineweb
data_files: fineweb/*.parquet
- config_name: gooaq
data_files: gooaq/*.parquet
- config_name: miracl
data_files: miracl/*.parquet
- config_name: lotte
data_files: lotte/*.parquet
- config_name: snli
data_files: snli/*.parquet
- config_name: paws
data_files: paws/*.parquet
- config_name: squad
data_files: squad/*.parquet
- config_name: mldr
data_files: mldr/*.parquet
- config_name: msmarco
data_files: msmarco/*.parquet
- config_name: msmarco_docs
data_files: msmarco_docs/*.parquet
- config_name: PubMedQA
data_files: PubMedQA/*.parquet
- config_name: swim-ir-monolingual
data_files: swim-ir-monolingual/*.parquet
- config_name: trivia_qa
data_files: trivia_qa/*.parquet
- config_name: mr-tydi
data_files: mr-tydi/*.parquet
dataset_info:
- config_name: english-words-definitions
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 466357
- config_name: fineweb
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 2100000
- config_name: gooaq
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 3012496
- config_name: miracl
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 2863
- config_name: lotte
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 13028
- config_name: snli
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 629334
- config_name: paws
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 1291304
- config_name: squad
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 87599
- config_name: mldr
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 10000
- config_name: msmarco
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 1010916
- config_name: msmarco_docs
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 2000000
- config_name: PubMedQA
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 272518
- config_name: swim-ir-monolingual
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 501371
- config_name: trivia_qa
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 87622
- config_name: mr-tydi
features:
- name: text
dtype: string
- name: embedding
sequence:
dtype: float32
length: 1024
- name: role
dtype: string
splits:
- name: train
num_examples: 3547
---
# Qwen3-Embedding-8B @ 1024d — full PyNIFE distillation corpus
Pre-computed teacher embeddings across **15 source datasets** spanning documents,
queries, and symmetric sentence pairs. Designed to replicate the full PyNIFE /
LEAF two-stage training recipe against `Qwen/Qwen3-Embedding-8B` as the teacher.
## Headline numbers
- **Total rows**: 11,488,955
- **Total tokens embedded**: 1469.8M
- **Total cost**: ~$146.98 on Fireworks serverless at $0.10/1M tokens
- **Embedding dim**: 1024 (MRL-native; truncate to 256/512 as needed downstream)
- **Max input tokens**: 2048 (client-side truncation via Qwen3 tokenizer)
- **Normalization**: L2-normalized unit vectors
## Schema (all configs identical)
| column | type |
|-----------|-------------------------------------|
| text | string (possibly truncated to ≤2048 tokens) |
| embedding | float32[1024], unit-norm |
| role | "doc" \| "query" \| "symmetric" |
## Per-source breakdown
| config | upstream | role | rows | tokens |
|---|---|---|---|---|
| `english-words-definitions` | [MongoDB/english-words-definitions](https://huggingface.co/datasets/MongoDB/english-words-definitions) | doc | 466,357 | 15.6M |
| `fineweb` | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | doc | 2,100,000 | 1204.8M |
| `gooaq` | [sentence-transformers/gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq) | query | 3,012,496 | 29.8M |
| `miracl` | [sentence-transformers/miracl](https://huggingface.co/datasets/sentence-transformers/miracl) | query | 2,863 | 0.0M |
| `lotte` | [mteb/lotte](https://huggingface.co/datasets/mteb/lotte) | query | 13,028 | 0.2M |
| `snli` | [stanfordnlp/snli](https://huggingface.co/datasets/stanfordnlp/snli) | symmetric | 629,334 | 6.4M |
| `paws` | [google-research-datasets/paws](https://huggingface.co/datasets/google-research-datasets/paws) | symmetric | 1,291,304 | 34.5M |
| `squad` | [sentence-transformers/squad](https://huggingface.co/datasets/sentence-transformers/squad) | query | 87,599 | 1.1M |
| `mldr` | [sentence-transformers/mldr](https://huggingface.co/datasets/sentence-transformers/mldr) | doc | 10,000 | 0.1M |
| `msmarco` | [sentence-transformers/msmarco-corpus](https://huggingface.co/datasets/sentence-transformers/msmarco-corpus) | query | 1,010,916 | 7.6M |
| `msmarco_docs` | [sentence-transformers/msmarco-corpus](https://huggingface.co/datasets/sentence-transformers/msmarco-corpus) | doc | 2,000,000 | 154.9M |
| `PubMedQA` | [qiaojin/PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) | query | 272,518 | 6.2M |
| `swim-ir-monolingual` | [nthakur/swim-ir-monolingual](https://huggingface.co/datasets/nthakur/swim-ir-monolingual) | query | 501,371 | 6.9M |
| `trivia_qa` | [mandarjoshi/trivia_qa](https://huggingface.co/datasets/mandarjoshi/trivia_qa) | query | 87,622 | 1.7M |
| `mr-tydi` | [sentence-transformers/mr-tydi](https://huggingface.co/datasets/sentence-transformers/mr-tydi) | query | 3,547 | 0.0M |
## Two-stage training recipe (following LEAF / PyNIFE)
Interleaving docs and queries during distillation does **not** work well
(see Tulkens' README). The recommended recipe is:
1. **Pretrain** on doc-like sources: concatenate the configs where
`role == "doc"` (`msmarco_docs`, `mldr`, `fineweb`, `english-words-definitions`).
2. **Finetune** with a lower learning rate on query-like sources:
concatenate the configs where `role == "query"` (`msmarco`, `gooaq`, `squad`,
`swim-ir-monolingual`, `trivia_qa`, `PubMedQA`, `miracl`, `mr-tydi`, `lotte`).
The `symmetric` sources (`snli`, `paws`) are sentence-pair corpora useful for
STS-style alignment; use at your discretion.
```python
from datasets import load_dataset, concatenate_datasets
REPO = "REPLACE_WITH_HF_REPO_ID"
# Stage 1: documents
doc_configs = ["msmarco_docs", "mldr", "fineweb", "english-words-definitions"]
doc_train = concatenate_datasets([
load_dataset(REPO, c, split="train") for c in doc_configs
])
# Stage 2: queries
query_configs = ["msmarco", "gooaq", "squad", "swim-ir-monolingual",
"trivia_qa", "PubMedQA", "miracl", "mr-tydi", "lotte"]
query_train = concatenate_datasets([
load_dataset(REPO, c, split="train") for c in query_configs
])
```
## Why no instruction prompt?
Per PyNIFE's empirical finding: static models cannot use instructions
meaningfully, because with no cross-token interaction the instruction prompt
can only produce a constant offset in embedding space — invisible to cosine
similarity ranking. So teacher embeddings here are computed on plain text.
## Asymmetric retrieval pattern
This corpus is raw material for an **asymmetric** architecture: expensive
teacher for document indexing, cheap distilled student for online queries.
See [PyNIFE](https://github.com/stephantul/pynife) and
[LEAF](https://arxiv.org/abs/2509.12539) for the theory.
## Reproducibility
Generated by `build_corpus.py`. Deterministic within a given set of upstream
dataset snapshots. Fireworks `accounts/fireworks/models/qwen3-embedding-8b`
with `dimensions=1024`; vectors re-normalized L2 client-side after receipt
(MRL truncation returns non-unit vectors).
提供机构:
carsondial



