five

MultiTurnCleanup

收藏
arXiv2023-10-28 更新2024-06-21 收录
下载链接:
https://github.com/huashen218/MultiTurnCleanup.git
下载链接
链接失效反馈
官方服务:
资源简介:
MultiTurnCleanup数据集是由密歇根大学和Google Research合作创建,专注于多轮口语对话转录清理任务。该数据集包含143,000条记录,基于Switchboard Corpus构建,旨在通过识别和移除多轮对话中的不连续现象来提高转录的可读性。数据集的创建过程涉及详细的数据标注方案和人工标注,确保数据质量。该数据集主要应用于自然语言处理领域,特别是提高机器翻译和问答系统的性能。

The MultiTurnCleanup Dataset is a collaborative creation by the University of Michigan and Google Research, focusing on the task of multi-turn spoken dialogue transcription cleanup. Comprising 143,000 records, the dataset is built upon the Switchboard Corpus, with the goal of enhancing transcript readability by identifying and eliminating discontinuous phenomena in multi-turn conversations. Its development incorporates a rigorous data annotation framework and manual annotation to guarantee data quality. Primarily utilized in the natural language processing domain, this dataset aims to improve the performance of machine translation and question answering systems.
提供机构:
密歇根大学
创建时间:
2023-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作