WebBase

Name: WebBase
Creator: UMBC
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC-webbase-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个大规模的英语语料库，包含了大约30亿个标记，被用于评估所提出的WEQ方法。在预处理后，词汇量达到了277,704个单词。该数据集的规模为30亿个标记，适用于多种任务，包括词语相似度计算、文本分类和命名实体识别。

This dataset is a large-scale English corpus containing approximately 3 billion tokens, which is used to evaluate the proposed WEQ method. After preprocessing, its vocabulary size reaches 277,704 words. Boasting a scale of 3 billion tokens, this dataset supports a variety of tasks including word similarity calculation, text classification and named entity recognition.

提供机构：

UMBC

5,000+

优质数据集

54 个

任务类型

进入经典数据集