RadioTalk
收藏arXiv2019-07-16 更新2024-06-21 收录
下载链接:
https://github.com/social-machines/RadioTalk
下载链接
链接失效反馈官方服务:
资源简介:
RadioTalk是由麻省理工学院媒体实验室社会机器实验室创建的大型语料库,包含2018年10月至2019年3月期间美国谈话广播的自动语音识别转录。该数据集包含约28.4万小时的广播内容,涵盖约28亿个单词,适用于自然语言处理、对话分析和社科研究。数据集内容丰富,包括地理位置、说话者边界、性别和广播节目信息等元数据。创建过程涉及音频摄入、转录和后处理三个阶段,旨在捕捉和分析不同年龄段和社交媒介使用率较低群体的媒体消费情况。
RadioTalk is a large-scale corpus created by the Social Machines Lab at the MIT Media Lab. It contains automatic speech recognition (ASR) transcripts of American talk radio broadcasts spanning from October 2018 to March 2019. The dataset includes approximately 284,000 hours of broadcast content, covering roughly 2.8 billion words, and is suitable for natural language processing (NLP), conversational analysis, and social science research. It is enriched with comprehensive metadata including geographic location, speaker diarization boundaries, gender information, and broadcast program details. Its development involves three core stages: audio ingestion, transcription, and post-processing, with the goal of capturing and analyzing media consumption patterns among demographic groups across different age ranges and populations with lower social media usage rates.
提供机构:
麻省理工学院媒体实验室社会机器实验室
创建时间:
2019-07-16



