stephantulkens/fineweb-10bt-mxbai-pooled

Name: stephantulkens/fineweb-10bt-mxbai-pooled
Creator: stephantulkens
Published: 2025-10-11 07:00:34
License: 暂无描述

Hugging Face2025-10-11 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/stephantulkens/fineweb-10bt-mxbai-pooled

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个基于Fineweb 10B样本的10Bt样本，使用Mixedbread AI的mixedbread-ai/mxbai-embed-large-v1进行嵌入。每个文档取前510个token进行嵌入，未使用任何指令。因为这些嵌入是通过Matryoshka Representation Learning训练的，所以可以安全地截断。这些嵌入主要用于大规模知识蒸馏。数据集由1490万行组成，每行有三个键：id、embedding和text。由于我们截断了原始文本，这可以直接用于训练，例如sentence-transformers，无需担心手动截断文本、匹配等。

This is the 10Bt sample of [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb), embedded with [Mixedbread AI](https://www.mixedbread.com/)s [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1). For each document, we take the first 510 tokens (the models max length -2 special tokens), and embed it, not using any instructions. Because the model was trained using Matryoshka Representation Learning, these embeddings can safely be truncated. These are mainly useful for large-scale knowledge distillation. The dataset consists of 14.9 million rows, each row has three keys: * `id`: the original id in the fineweb sample * `embedding`: The 1024-dimensional embedding * `text`: The original text, _truncated to the slice that was actually seen by the model_ Because we truncate the original text, this can be directly used for training in, e.g., `sentence-transformers`, without having to worry about manually truncating text, matching etc.

提供机构：

stephantulkens

5,000+

优质数据集

54 个

任务类型

进入经典数据集