WMT News 2018

arXiv2025-09-30 收录

下载链接：

http://data.statmt.org/news-crawl/en/

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是WMT新闻2018数据集的一个子集，其中包含了随机选取的句子，旨在分析变压器语言模型中编码的概念。在处理过程中，该数据集经过了筛选，剔除了出现频率低于10的词汇，并且每个词型的出现次数不超过10次。最终形成的数据集包含25,000种词汇类型，每种词汇平均有10个上下文环境。该数据集的规模为250,000个句子（约500万个标记），其任务是进行语言模型中编码概念的聚类与分析。

This dataset is a subset of the WMT News 2018 dataset, consisting of randomly selected sentences intended for the analysis of encoded concepts in Transformer-based language models. During preprocessing, the dataset was filtered to remove words with an occurrence frequency lower than 10, and the occurrence count of each word type was restricted to no more than 10. The resulting dataset contains 25,000 word types, with each word type having an average of 10 contextual occurrences. The dataset comprises 250,000 sentences (approximately 5 million tokens), and its core task is the clustering and analysis of encoded concepts in language models.

提供机构：

WMT

5,000+

优质数据集

54 个

任务类型

进入经典数据集