TEDTalk Corpus (2016)
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/isl-mt/SemanticWordReplacement-LREC2018
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为TEDTalk语料库,包含了2007年至2016年间发表的超过2600场TED演讲的转录文本。为了评估文本理解模型,这些文本被修改过,加入了非上下文语境的错误。此外,该数据集经过了调整,加入了超过25,000个上下文单词的替换,为模型评估提供了一个具有挑战性的修改后文本段落集合。规模上,该数据集涵盖了超过2600场TED演讲,其任务重点在于文本理解和错误检测。
This dataset is named the TEDTalk Corpus, which contains transcribed texts of more than 2600 TED Talks delivered between 2007 and 2016. To evaluate text understanding models, these texts have been modified to incorporate out-of-context errors. Additionally, the dataset has been augmented with over 25,000 context-aware word substitutions, providing a challenging collection of modified text passages for model evaluation. In terms of scale, this dataset encompasses over 2600 TED Talks, with its key tasks focusing on text understanding and error detection.
提供机构:
IWSLT



