Rubin-Wei/kNN-Targets-wikipedia-mistral

Name: Rubin-Wei/kNN-Targets-wikipedia-mistral
Creator: Rubin-Wei
Published: 2025-10-22 01:03:44
License: 暂无描述

Hugging Face2025-10-22 更新2025-11-15 收录

下载链接：

https://hf-mirror.com/datasets/Rubin-Wei/kNN-Targets-wikipedia-mistral

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集为语言建模提供了k最近邻(kNN)目标分布。Wikipedia语料库中的每个标记都与冻结语言模型表示空间中其top-k最近邻的软概率分布相关联。这些目标可以用来训练MLP Memory。数据集包括五个字段：query_ids（每个查询标记的全局唯一顺序标识符），id_cnt（kNN分布中的标记数），token_id（与top-k邻近标记对应的词汇索引数组），prob（与每个token_id相关的概率数组），label（kNN分布旨在预测或增强的真实标记ID）。

This dataset provides k-nearest neighbor (kNN) target distributions for language modeling. Each token in the Wikipedia corpus is associated with a soft probability distribution over its top-k nearest neighbors in the representation space of a frozen language model. These targets can be used to train MLP Memory. The dataset includes five fields: query_ids (a globally unique and sequentially ordered identifier for each query token), id_cnt (the number of tokens in the kNN distribution), token_id (an array of vocabulary indices corresponding to the top-k neighbor tokens), prob (an array of probabilities associated with each token_id), and label (the ground-truth token ID that the kNN distribution is intended to predict or augment).

提供机构：

Rubin-Wei

5,000+

优质数据集

54 个

任务类型

进入经典数据集