DisSent Sentence Pair Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/windweller/DisExtract
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了经过精心筛选的470万6292对句子,这些句子通过15个语篇标记词相互关联,是通过依赖性分析和明确的语篇关系提取出来的。数据集被划分为训练集、验证集和测试集,比例分别为0.9、0.05和0.05,尽管数据集不平衡,但这样的划分仍有助于学习较为罕见的类别。该数据集的规模达到了470万6292对句子,其任务是基于语篇标记词学习句子表征,并预测句子之间的关系。
This dataset contains 4,706,292 carefully filtered sentence pairs. Each pair is linked by one of 15 discourse markers, and all pairs are extracted via dependency analysis and explicit discourse relations. The dataset is split into training, validation, and test sets at a ratio of 0.9, 0.05, and 0.05 respectively. Although the dataset is imbalanced, this partitioning still supports the learning of relatively rare discourse relation categories. The core task of this dataset is to learn sentence representations based on discourse markers and predict the discourse relations between paired sentences.
提供机构:
Curated by the authors using dependency parsing



