stephantulkens/lotte-query-mxbai-pooled

Name: stephantulkens/lotte-query-mxbai-pooled
Creator: stephantulkens
Published: 2025-10-13 10:06:32
License: 暂无描述

Hugging Face2025-10-13 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/stephantulkens/lotte-query-mxbai-pooled

下载链接

链接失效反馈

官方服务：

资源简介：

Embedpress数据集是LOTTE数据集的一个子集，包含了由Mixedbread AI的mixedbread-ai/mxbai-embed-large-v1模型生成的文档嵌入。每个文档都被截断到前510个token，并嵌入而没有使用任何指令。这个数据集对于大规模的知识蒸馏非常有用，并且由13k行组成，每行有三个键：`id`，`embedding`和`text`。文本被截断到模型实际看到的切片，这使得它可以直接用于在sentence-transformers中训练，而无需手动截断文本。

Embedpress: mixedbread large on the LOTTE queries dataset This is the query portion of the [LOTTE](colbertv2/lotte) dataset, embedded with [Mixedbread AI](https://www.mixedbread.com/)s [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1). For each document, we take the first 510 tokens (the models max length -2 special tokens), and embed it, not using any instructions. Because the model was trained using Matryoshka Representation Learning, these embeddings can safely be truncated. These are mainly useful for large-scale knowledge distillation. The dataset consists of 13k rows, each row has three keys: * `id`: the original id in the fineweb sample * `embedding`: The 1024-dimensional embedding * `text`: The original text, _truncated to the slice that was actually seen by the model_ Because we truncate the original text, this can be directly used for training in, e.g., `sentence-transformers`, without having to worry about manually truncating text, matching etc.

提供机构：

stephantulkens

5,000+

优质数据集

54 个

任务类型

进入经典数据集