five

Semantic Coherence Dataset - SCD

收藏
NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://data.mendeley.com/datasets/s4dtmfmzxw
下载链接
链接失效反馈
官方服务:
资源简介:
Textual data are central to assess metrics built on top of language models. The dataset contains speech transcripts, which were arranged into two main classes, intended to experiment on intra-subject semantic coherence, and on inter-subject semantic coherence. Transcripts collected have been extracted from talks during almost 13 hours (overall 12:45:17) for the former class, and almost 30 hours (29:47:34) for the latter one. Data delivered in this dataset have been employed to investigate whether the perplexity metric provides reliable results, both in within-subject condition and in across-subject condition. More specifically, perplexity is a measure originally conceived to assess the probabilistic inference properties of language models: it has been recently proved to be an appropriate device to categorize speech transcripts from healthy subjects vs. subjects suffering from Alzheimer Disease. This dataset has been employed to investigate the reliability of the perplexity metrics; data herein can be reused to conduct analysis on measures that rely on probabilistic models and that are aimed at analyzing the linguistic features of text documents.
创建时间:
2022-09-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作