five

RyanMarcus/qwen-8B-embedding-distill

收藏
Hugging Face2025-12-31 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RyanMarcus/qwen-8B-embedding-distill
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - feature-extraction language: - en size_categories: - 10M<n<100M --- # Data for training distilled text embedding models This is a dataset of ~12M text-vector pairs produced by the [Qwen3-Embedding-8b](https://huggingface.co/Qwen/Qwen3-Embedding-8B) model. This dataset is intended to support training embedding models to match Qwen's outputs ("distillation"). Texts are truncated to 2048 characters. The texts used in this dataset are a mix of random samples from these datasets: * `sentence-transformers/amazon-reviews` * `sentence-transformers/msmarco-bm25` * `sentence-transformers/eli5` * `sentence-transformers/s2orc` * `sentence-transformers/specter` * `sentence-transformers/sentence-compression` * `sentence-transformers/npr` If the dataset size was greater than 2M, then 2M random samples were taken. If the dataset size was less than 2M, the entire dataset was used. Embeddings were produced in around 20 hours with 8xRTX 6000 Pros.

--- task_categories: - 特征提取(feature-extraction) language: - 英语(en) size_categories: - 1000万<n<1亿 --- # 用于训练蒸馏式文本嵌入模型的数据集 本数据集包含约1200万条由[Qwen3-Embedding-8b](https://huggingface.co/Qwen/Qwen3-Embedding-8B)模型生成的文本-向量对,旨在支持训练能够匹配通义千问(Qwen)输出的嵌入模型,即“蒸馏”训练。 文本被截断至2048个字符。本数据集所用文本来自以下数据集的随机采样混合: * `sentence-transformers/amazon-reviews` * `sentence-transformers/msmarco-bm25` * `sentence-transformers/eli5` * `sentence-transformers/s2orc` * `sentence-transformers/specter` * `sentence-transformers/sentence-compression` * `sentence-transformers/npr` 若原数据集规模大于200万,则随机采样200万条数据;若原数据集规模小于200万,则直接使用全部数据。 嵌入向量的生成耗时约20小时,使用了8张RTX 6000 Pro显卡。
提供机构:
RyanMarcus
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作