CMU-SE dataset
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/clab/sp2016.11-731/tree/master/hw4/data
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个经过预处理的简单英语句子集合,包含44,016个句子,词汇量为3,122个词种。该数据集旨在生成连贯的句子。在处理过程中,少于七个单词的句子被忽略,而超过七个单词的句子则被截断。这一任务的目标是句子生成。
This dataset is a preprocessed collection of simple English sentences, consisting of 44,016 sentences with a vocabulary size of 3,122 unique word types. It is designed for coherent sentence generation. During preprocessing, sentences with fewer than seven words were discarded, while those exceeding seven words were truncated. The goal of this task is sentence generation.



