WebBase
收藏arXiv2025-09-30 收录
下载链接:
http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个大规模的英语语料库,包含了大约30亿个标记,被用于评估所提出的WEQ方法。在预处理后,词汇量达到了277,704个单词。该数据集的规模为30亿个标记,适用于多种任务,包括词语相似度计算、文本分类和命名实体识别。
This dataset is a large-scale English corpus containing approximately 3 billion tokens, which is used to evaluate the proposed WEQ method. After preprocessing, its vocabulary size reaches 277,704 words. Boasting a scale of 3 billion tokens, this dataset supports a variety of tasks including word similarity calculation, text classification and named entity recognition.
提供机构:
UMBC



