stephantulkens/msmarco-mxbai-pooled

Name: stephantulkens/msmarco-mxbai-pooled
Creator: stephantulkens
Published: 2025-10-11 07:01:01
License: 暂无描述

Hugging Face2025-10-11 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/stephantulkens/msmarco-mxbai-pooled

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个完整的MsMarco语料库，使用Mixedbread AI的mixedbread-ai/mxbai-embed-large-v1模型进行了嵌入。每个文档取前510个token进行嵌入，不使用任何指令。由于模型使用Matryoshka Representation Learning进行训练，这些嵌入可以被安全地截断。这些嵌入主要用于大规模知识蒸馏。数据集包含880万行，每行有三个键：`id`（原始id），`embedding`（1024维嵌入），`text`（原始文本，截断到模型实际看到的切片）。由于我们截断了原始文本，这可以直接用于在sentence-transformers中进行训练，而无需担心手动截断文本、匹配等问题。

This is the full MsMarco corpus, embedded with Mixedbread AIs mixedbread-ai/mxbai-embed-large-v1. For each document, we take the first 510 tokens (the models max length -2 special tokens), and embed it, not using any instructions. Because the model was trained using Matryoshka Representation Learning, these embeddings can safely be truncated. These are mainly useful for large-scale knowledge distillation. The dataset consists of 8.8 million rows, each row has three keys: `id` (the original id in the fineweb sample), `embedding` (The 1024-dimensional embedding), `text` (The original text, _truncated to the slice that was actually seen by the model_). Because we truncate the original text, this can be directly used for training in, e.g., `sentence-transformers`, without having to worry about manually truncating text, matching etc.

提供机构：

stephantulkens

5,000+

优质数据集

54 个

任务类型

进入经典数据集