erikkaum/nomic-unsupervised-embedded-10M
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/erikkaum/nomic-unsupervised-embedded-10M
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: Nomic Unsupervised Embedded 10M
license: apache-2.0
task_categories:
- feature-extraction
language:
- en
tags:
- embeddings
- retrieval
- text-embeddings
- synthetic
size_categories:
- 10M<n<100M
---
# Nomic Unsupervised Embedded 10M
This dataset is a derived embedding dataset built from 10 million pairs from `nomic-ai/nomic-embed-unsupervised-data`.
It contains text pairs (`query`, `document`) and their corresponding dense vector representations:
- `query_embedding`
- `document_embedding`
## Dataset Summary
- **Base dataset**: `nomic-ai/nomic-embed-unsupervised-data`
- **Rows**: ~10M
- **Embedding model**: `Qwen/Qwen3-Embedding-0.6B`
- **Vector dtype**: `float32`
- **Intended use**: retrieval training/evaluation, similarity search, hard-negative mining, and embedding analysis
## How It Was Created
Embeddings were generated by streaming rows from the source dataset and encoding:
- `query` with instruction-formatted query prompts (split-aware prompt selection)
- `document` as the candidate passage text
## Schema
Expected columns include:
- original source fields (including `query`, `document`, split metadata, etc.)
- `query_embedding: list<float32>`
- `document_embedding: list<float32>`
## Prompting Strategy for Query Embeddings
Prompt templates are selected by source split type:
- QA: "Given a web search query, retrieve relevant passages that answer the query"
- Duplicate: "Given a question, retrieve similar questions that ask the same thing"
- Semantic: "Given a text, retrieve passages about the same topic"
## Limitations
- This is an automatically generated embedding dataset; no manual quality annotation was added here.
- Embeddings inherit biases and failure modes of the base model and source data.
- Some source texts may be truncated during encoding according to pipeline max-length settings.
## Citation
If you use this dataset, please cite:
- the original source dataset: `nomic-ai/nomic-embed-unsupervised-data`
- the embedding model: `Qwen/Qwen3-Embedding-0.6B`
- this derived dataset repository
提供机构:
erikkaum



