five

SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English

收藏
Hugging Face2024-08-16 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en tags: - australia - new-zealand - corpus - english size_categories: - 10K<n<100K --- # Corpus of Australian and New Zealand Spoken English (CoANZSE) ## Announcement The dataset has limited access and requires access/download permission from [Harvard Dataverse - Corpus of Australian and New Zealand Spoken English](https://doi.org/10.7910/DVN/GW35AK). Please acknowledge that the dataset owners are [Steven Coats](https://cc.oulu.fi/~scoats/) and [Jeremy Yuenger](https://www.iq.harvard.edu/people/jeremy-yuenger). This repository is only used for study and research purposes for [Southern Cross AI](https://github.com/southern-cross-ai). Any commercial use is not permitted by the dataset owner. Any distribution of this repository is not recommended. For more information, please read **License and Terms of Use**, or visit [Harvard Dataverse - Community Norms](https://dataverse.org/best-practices/dataverse-community-norms) and the [original dataset page](https://doi.org/10.7910/DVN/GW35AK). ## Overview **Subjects**: Arts and Humanities; Computer and Information Science; Social Sciences; Other **Keywords**: Corpus Linguistics; Dialectology; Spoken Language; Speech Transcripts; Australia; New Zealand The [Corpus of Australian and New Zealand Spoken English (CoANZSE)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FGW35AK) is a **196-million-word corpus** of **geolocated automatic speech recognition (ASR) YouTube transcripts** from **local government channels** in **Australia** and **New Zealand**, created for the study of lexical, grammatical, and discourse-pragmatic phenomena of spoken language, as well as for content and language analysis in digital humanities and social science fields. Annotation includes **individual word timings** and **video IDs** of transcripts, making it easy to instantly view the video(s) for any given search. The corpus was created from **55,896 ASR transcripts** from **472 YouTube channels**, corresponding to almost **24,007 hours of video**. The size of the corpus is **195,583,873 tokens**. The channels sampled in the corpus are associated with **local government entities** such as **local, city, county, district, and regional councils**, and transcripts are from a range of video types. Recordings of public meetings are well-represented. Related resources are the [Corpus of North American Spoken English](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV) and the [Corpus of British Isles Spoken English](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD). ## Data Source The current dataset is cleaned by [Xinyu Mao](https://huggingface.co/XinyuMao) and [Yifan Luo](https://huggingface.co/yifan-luo). The dataset can also be found on [GitHub](https://github.com/southern-cross-ai/CoANZSE).
提供机构:
SouthernCrossAI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作