five

Dataset of sentence pairs from renowned authors in North American literature

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://data.mendeley.com/datasets/tg6pxsnxr5
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains pairs of sentences taken from 35 literary works by three renowned authors of North American literature, namely: William Cuthbert Faulkner, Ernest Miller Hemingway and Philip Milton Roth. There are three versions of the dataset: 1 - Pairs of pre-processed sentences with removal of punctuation, removal of alphanumeric characters and normalization of words in lowercase; 2 - Pairs of pre-processed sentences including the removal of stopwords; 3 - Pairs of pre-processed sentences including the removal of stopwords and the lemmatization of words. The datasets were created inspired by a Kaggle challenge of identifying duplicate sentence pairs. Each dataset contains 72600 sentences by each author, 72000 being reserved for training and validation of the LSTM Siamese neural network used and 600 for testing/prediction.
创建时间:
2021-09-23
二维码
社区交流群
二维码
科研交流群
商业服务