stephantulkens/msmarco-mxbai-pooled
收藏Hugging Face2025-10-11 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/stephantulkens/msmarco-mxbai-pooled
下载链接
链接失效反馈官方服务:
资源简介:
这是一个完整的MsMarco语料库,使用Mixedbread AI的mixedbread-ai/mxbai-embed-large-v1模型进行了嵌入。每个文档取前510个token进行嵌入,不使用任何指令。由于模型使用Matryoshka Representation Learning进行训练,这些嵌入可以被安全地截断。这些嵌入主要用于大规模知识蒸馏。数据集包含880万行,每行有三个键:`id`(原始id),`embedding`(1024维嵌入),`text`(原始文本,截断到模型实际看到的切片)。由于我们截断了原始文本,这可以直接用于在sentence-transformers中进行训练,而无需担心手动截断文本、匹配等问题。
This is the full MsMarco corpus, embedded with Mixedbread AIs mixedbread-ai/mxbai-embed-large-v1. For each document, we take the first 510 tokens (the models max length -2 special tokens), and embed it, not using any instructions. Because the model was trained using Matryoshka Representation Learning, these embeddings can safely be truncated. These are mainly useful for large-scale knowledge distillation. The dataset consists of 8.8 million rows, each row has three keys: `id` (the original id in the fineweb sample), `embedding` (The 1024-dimensional embedding), `text` (The original text, _truncated to the slice that was actually seen by the model_). Because we truncate the original text, this can be directly used for training in, e.g., `sentence-transformers`, without having to worry about manually truncating text, matching etc.
提供机构:
stephantulkens



