five

Rubin-Wei/kNN-Targets-wikipedia-mistral

收藏
Hugging Face2025-10-22 更新2025-11-15 收录
下载链接:
https://hf-mirror.com/datasets/Rubin-Wei/kNN-Targets-wikipedia-mistral
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集为语言建模提供了k最近邻(kNN)目标分布。Wikipedia语料库中的每个标记都与冻结语言模型表示空间中其top-k最近邻的软概率分布相关联。这些目标可以用来训练MLP Memory。数据集包括五个字段:query_ids(每个查询标记的全局唯一顺序标识符),id_cnt(kNN分布中的标记数),token_id(与top-k邻近标记对应的词汇索引数组),prob(与每个token_id相关的概率数组),label(kNN分布旨在预测或增强的真实标记ID)。

This dataset provides k-nearest neighbor (kNN) target distributions for language modeling. Each token in the Wikipedia corpus is associated with a soft probability distribution over its top-k nearest neighbors in the representation space of a frozen language model. These targets can be used to train MLP Memory. The dataset includes five fields: query_ids (a globally unique and sequentially ordered identifier for each query token), id_cnt (the number of tokens in the kNN distribution), token_id (an array of vocabulary indices corresponding to the top-k neighbor tokens), prob (an array of probabilities associated with each token_id), and label (the ground-truth token ID that the kNN distribution is intended to predict or augment).
提供机构:
Rubin-Wei
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作