five

stephantulkens/lotte-query-mxbai-pooled

收藏
Hugging Face2025-10-13 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/stephantulkens/lotte-query-mxbai-pooled
下载链接
链接失效反馈
官方服务:
资源简介:
Embedpress数据集是LOTTE数据集的一个子集,包含了由Mixedbread AI的mixedbread-ai/mxbai-embed-large-v1模型生成的文档嵌入。每个文档都被截断到前510个token,并嵌入而没有使用任何指令。这个数据集对于大规模的知识蒸馏非常有用,并且由13k行组成,每行有三个键:`id`,`embedding`和`text`。文本被截断到模型实际看到的切片,这使得它可以直接用于在sentence-transformers中训练,而无需手动截断文本。

Embedpress: mixedbread large on the LOTTE queries dataset This is the query portion of the [LOTTE](colbertv2/lotte) dataset, embedded with [Mixedbread AI](https://www.mixedbread.com/)s [mixedbread-ai/mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1). For each document, we take the first 510 tokens (the models max length -2 special tokens), and embed it, not using any instructions. Because the model was trained using Matryoshka Representation Learning, these embeddings can safely be truncated. These are mainly useful for large-scale knowledge distillation. The dataset consists of 13k rows, each row has three keys: * `id`: the original id in the fineweb sample * `embedding`: The 1024-dimensional embedding * `text`: The original text, _truncated to the slice that was actually seen by the model_ Because we truncate the original text, this can be directly used for training in, e.g., `sentence-transformers`, without having to worry about manually truncating text, matching etc.
提供机构:
stephantulkens
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作