stephantulkens/fineweb-10bt-mxbai-pooled
收藏Hugging Face2025-10-11 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/stephantulkens/fineweb-10bt-mxbai-pooled
下载链接
链接失效反馈官方服务:
资源简介:
这是一个基于Fineweb 10B样本的10Bt样本,使用Mixedbread AI的mixedbread-ai/mxbai-embed-large-v1进行嵌入。每个文档取前510个token进行嵌入,未使用任何指令。因为这些嵌入是通过Matryoshka Representation Learning训练的,所以可以安全地截断。这些嵌入主要用于大规模知识蒸馏。数据集由1490万行组成,每行有三个键:id、embedding和text。由于我们截断了原始文本,这可以直接用于训练,例如sentence-transformers,无需担心手动截断文本、匹配等。
This is the 10Bt sample of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), embedded with [Mixedbread AI](https://www.mixedbread.com/)s [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1). For each document, we take the first 510 tokens (the models max length -2 special tokens), and embed it, not using any instructions. Because the model was trained using Matryoshka Representation Learning, these embeddings can safely be truncated. These are mainly useful for large-scale knowledge distillation. The dataset consists of 14.9 million rows, each row has three keys: * `id`: the original id in the fineweb sample * `embedding`: The 1024-dimensional embedding * `text`: The original text, _truncated to the slice that was actually seen by the model_ Because we truncate the original text, this can be directly used for training in, e.g., `sentence-transformers`, without having to worry about manually truncating text, matching etc.
提供机构:
stephantulkens



