Corpus of Australian and New Zealand Spoken English

DataONE2025-05-21 更新2025-11-15 收录

下载链接：

https://search.dataone.org/view/sha256:f469e47e0017def82841dd1682e6eeb7a65186edf309711e648cae7e7de96406

下载链接

链接失效反馈

官方服务：

资源简介：

The Corpus of Australian and New Zealand Spoken English (CoANZSE) is a 196-million-word corpus of geolocated automatic speech recognition (ASR) YouTube transcripts from local government channels in Australia and New Zealand, created for the study of lexical, grammatical, and discourse-pragmatic phenomena of spoken language, as well as for content and language analysis in digital humanities and social science fields. Annotation includes individual word timings and video IDs of transcripts, making it easy to instantly view the video(s) for any given search. The corpus was created from 55,896 ASR transcripts from 472 YouTube channels, corresponding to almost 24,007 hours of video. The size of the corpus is 195,583,873 tokens. The channels sampled in the corpus are associated with local government entities such as local, city, county, district, and regional councils, and transcripts are from a range of video types. Recordings of public meetings are well-represented. Related resources are the Corpus of North American Spoken English and the Corpus of British Isles Spoken English. A searchable online version of this data is available at coanzse.org. The resource also includes audio and forced alignments.

创建时间：

2025-10-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集