five

erikkaum/nomic-unsupervised-embedded-10M

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/erikkaum/nomic-unsupervised-embedded-10M
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Nomic Unsupervised Embedded 10M license: apache-2.0 task_categories: - feature-extraction language: - en tags: - embeddings - retrieval - text-embeddings - synthetic size_categories: - 10M<n<100M --- # Nomic Unsupervised Embedded 10M This dataset is a derived embedding dataset built from 10 million pairs from `nomic-ai/nomic-embed-unsupervised-data`. It contains text pairs (`query`, `document`) and their corresponding dense vector representations: - `query_embedding` - `document_embedding` ## Dataset Summary - **Base dataset**: `nomic-ai/nomic-embed-unsupervised-data` - **Rows**: ~10M - **Embedding model**: `Qwen/Qwen3-Embedding-0.6B` - **Vector dtype**: `float32` - **Intended use**: retrieval training/evaluation, similarity search, hard-negative mining, and embedding analysis ## How It Was Created Embeddings were generated by streaming rows from the source dataset and encoding: - `query` with instruction-formatted query prompts (split-aware prompt selection) - `document` as the candidate passage text ## Schema Expected columns include: - original source fields (including `query`, `document`, split metadata, etc.) - `query_embedding: list<float32>` - `document_embedding: list<float32>` ## Prompting Strategy for Query Embeddings Prompt templates are selected by source split type: - QA: "Given a web search query, retrieve relevant passages that answer the query" - Duplicate: "Given a question, retrieve similar questions that ask the same thing" - Semantic: "Given a text, retrieve passages about the same topic" ## Limitations - This is an automatically generated embedding dataset; no manual quality annotation was added here. - Embeddings inherit biases and failure modes of the base model and source data. - Some source texts may be truncated during encoding according to pipeline max-length settings. ## Citation If you use this dataset, please cite: - the original source dataset: `nomic-ai/nomic-embed-unsupervised-data` - the embedding model: `Qwen/Qwen3-Embedding-0.6B` - this derived dataset repository
提供机构:
erikkaum
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作