stephantulkens/msmarco-vocab
收藏Hugging Face2025-10-15 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/stephantulkens/msmarco-vocab
下载链接
链接失效反馈官方服务:
资源简介:
这是MsMarco数据集文档部分的词汇表。该词汇表使用bert-base-uncased分词器进行了规范化和预处理。它包括185万个带有相关频率和文档频率的标记,按频率排序。该数据集可用于获取语料库子部分的概率、定义分词器扩展和分析语义内容。数据集的下载大小为19.6 MB,数据集大小为53.3 MB。它由一个名为train的单一划分组成,包含185万个示例,文件大小为53.3 MB。感谢Mixedbread AI为研究小型检索模型提供了GPU奖励。
This is the vocabulary of the document part of the MsMarco dataset. The vocabulary was normalized and pretokenized using the bert-base-uncased tokenizer. It includes 1.85 million tokens with their associated frequency and document frequency, sorted by frequency. The dataset can be used for obtaining probabilities of subparts of a corpus, defining tokenizer extensions, and analyzing semantic content. The dataset has a download size of 19.6 MB and a dataset size of 53.3 MB. It consists of a single split named train with 1.85 million examples and a file size of 53.3 MB. Acknowledgments are given to Mixedbread AI for a GPU grant for research.
提供机构:
stephantulkens



