Dataset of sentence pairs from renowned authors in North American literature

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/tg6pxsnxr5

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains pairs of sentences taken from 35 literary works by three renowned authors of North American literature, namely: William Cuthbert Faulkner, Ernest Miller Hemingway and Philip Milton Roth. There are three versions of the dataset: 1 - Pairs of pre-processed sentences with removal of punctuation, removal of alphanumeric characters and normalization of words in lowercase; 2 - Pairs of pre-processed sentences including the removal of stopwords; 3 - Pairs of pre-processed sentences including the removal of stopwords and the lemmatization of words. The datasets were created inspired by a Kaggle challenge of identifying duplicate sentence pairs. Each dataset contains 72600 sentences by each author, 72000 being reserved for training and validation of the LSTM Siamese neural network used and 600 for testing/prediction.

本数据集收录了北美文学领域三位知名作家的35部文学作品中的句子对，三位作家分别为威廉·卡瑟伯特·福克纳（William Cuthbert Faulkner）、欧内斯特·米勒尔·海明威（Ernest Miller Hemingway）与菲利普·米尔顿·罗斯（Philip Milton Roth）。本数据集共包含三个版本： 1. 基础预处理句子对版本：该版本的预处理步骤包括移除标点符号、剔除字母数字字符，并将所有单词统一转为小写形式； 2. 去停用词预处理句子对版本：该版本的预处理步骤包括移除停用词（stopwords）； 3. 去停用词并词形还原预处理句子对版本：该版本的预处理步骤包括移除停用词，并对单词进行词形还原（lemmatization）。本数据集的构建灵感源自一项旨在识别重复句子对的Kaggle竞赛。每个数据集包含每位作家的72600条句子，其中72000条用于所采用的孪生长短期记忆（Long Short-Term Memory，LSTM）神经网络的训练与验证，剩余600条则用于测试与预测。

创建时间：

2021-09-23