SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English
收藏Hugging Face2024-08-16 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- australia
- new-zealand
- corpus
- english
size_categories:
- 10K<n<100K
---
# Corpus of Australian and New Zealand Spoken English (CoANZSE)
## Announcement
The dataset has limited access and requires access/download permission from [Harvard Dataverse - Corpus of Australian and New Zealand Spoken English](https://doi.org/10.7910/DVN/GW35AK).
Please acknowledge that the dataset owners are [Steven Coats](https://cc.oulu.fi/~scoats/) and [Jeremy Yuenger](https://www.iq.harvard.edu/people/jeremy-yuenger). This repository is only used for study and research purposes for [Southern Cross AI](https://github.com/southern-cross-ai).
Any commercial use is not permitted by the dataset owner. Any distribution of this repository is not recommended. For more information, please read **License and Terms of Use**, or visit [Harvard Dataverse - Community Norms](https://dataverse.org/best-practices/dataverse-community-norms) and the [original dataset page](https://doi.org/10.7910/DVN/GW35AK).
## Overview
**Subjects**: Arts and Humanities; Computer and Information Science; Social Sciences; Other
**Keywords**: Corpus Linguistics; Dialectology; Spoken Language; Speech Transcripts; Australia; New Zealand
The [Corpus of Australian and New Zealand Spoken English (CoANZSE)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FGW35AK) is a **196-million-word corpus** of **geolocated automatic speech recognition (ASR) YouTube transcripts** from **local government channels** in **Australia** and **New Zealand**, created for the study of lexical, grammatical, and discourse-pragmatic phenomena of spoken language, as well as for content and language analysis in digital humanities and social science fields.
Annotation includes **individual word timings** and **video IDs** of transcripts, making it easy to instantly view the video(s) for any given search. The corpus was created from **55,896 ASR transcripts** from **472 YouTube channels**, corresponding to almost **24,007 hours of video**. The size of the corpus is **195,583,873 tokens**. The channels sampled in the corpus are associated with **local government entities** such as **local, city, county, district, and regional councils**, and transcripts are from a range of video types. Recordings of public meetings are well-represented. Related resources are the [Corpus of North American Spoken English](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV) and the [Corpus of British Isles Spoken English](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD).
## Data Source
The current dataset is cleaned by [Xinyu Mao](https://huggingface.co/XinyuMao) and [Yifan Luo](https://huggingface.co/yifan-luo). The dataset can also be found on [GitHub](https://github.com/southern-cross-ai/CoANZSE).
提供机构:
SouthernCrossAI



