stephantulkens/msmarco-vocab

Name: stephantulkens/msmarco-vocab
Creator: stephantulkens
Published: 2025-10-15 13:05:38
License: 暂无描述

Hugging Face2025-10-15 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/stephantulkens/msmarco-vocab

下载链接

链接失效反馈

官方服务：

资源简介：

这是MsMarco数据集文档部分的词汇表。该词汇表使用bert-base-uncased分词器进行了规范化和预处理。它包括185万个带有相关频率和文档频率的标记，按频率排序。该数据集可用于获取语料库子部分的概率、定义分词器扩展和分析语义内容。数据集的下载大小为19.6 MB，数据集大小为53.3 MB。它由一个名为train的单一划分组成，包含185万个示例，文件大小为53.3 MB。感谢Mixedbread AI为研究小型检索模型提供了GPU奖励。

This is the vocabulary of the document part of the MsMarco dataset. The vocabulary was normalized and pretokenized using the bert-base-uncased tokenizer. It includes 1.85 million tokens with their associated frequency and document frequency, sorted by frequency. The dataset can be used for obtaining probabilities of subparts of a corpus, defining tokenizer extensions, and analyzing semantic content. The dataset has a download size of 19.6 MB and a dataset size of 53.3 MB. It consists of a single split named train with 1.85 million examples and a file size of 53.3 MB. Acknowledgments are given to Mixedbread AI for a GPU grant for research.

提供机构：

stephantulkens

5,000+

优质数据集

54 个

任务类型

进入经典数据集