MultiTurnCleanup
收藏arXiv2023-10-28 更新2024-06-21 收录
下载链接:
https://github.com/huashen218/MultiTurnCleanup.git
下载链接
链接失效反馈官方服务:
资源简介:
MultiTurnCleanup数据集是由密歇根大学和Google Research合作创建,专注于多轮口语对话转录清理任务。该数据集包含143,000条记录,基于Switchboard Corpus构建,旨在通过识别和移除多轮对话中的不连续现象来提高转录的可读性。数据集的创建过程涉及详细的数据标注方案和人工标注,确保数据质量。该数据集主要应用于自然语言处理领域,特别是提高机器翻译和问答系统的性能。
The MultiTurnCleanup Dataset is a collaborative creation by the University of Michigan and Google Research, focusing on the task of multi-turn spoken dialogue transcription cleanup. Comprising 143,000 records, the dataset is built upon the Switchboard Corpus, with the goal of enhancing transcript readability by identifying and eliminating discontinuous phenomena in multi-turn conversations. Its development incorporates a rigorous data annotation framework and manual annotation to guarantee data quality. Primarily utilized in the natural language processing domain, this dataset aims to improve the performance of machine translation and question answering systems.
提供机构:
密歇根大学
创建时间:
2023-05-20



