SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English

Name: SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English
Creator: SouthernCrossAI
Published: 2024-08-16 04:02:23
License: 暂无描述

Hugging Face2024-08-16 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SouthernCrossAI/CoANZSE_Corpus_of_Australian_and_New_Zealand_Spoken_English

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit language: - en tags: - australia - new-zealand - corpus - english size_categories: - 10K<n<100K --- # Corpus of Australian and New Zealand Spoken English (CoANZSE) ## Announcement The dataset has limited access and requires access/download permission from [Harvard Dataverse - Corpus of Australian and New Zealand Spoken English](https://doi.org/10.7910/DVN/GW35AK). Please acknowledge that the dataset owners are [Steven Coats](https://cc.oulu.fi/~scoats/) and [Jeremy Yuenger](https://www.iq.harvard.edu/people/jeremy-yuenger). This repository is only used for study and research purposes for [Southern Cross AI](https://github.com/southern-cross-ai). Any commercial use is not permitted by the dataset owner. Any distribution of this repository is not recommended. For more information, please read **License and Terms of Use**, or visit [Harvard Dataverse - Community Norms](https://dataverse.org/best-practices/dataverse-community-norms) and the [original dataset page](https://doi.org/10.7910/DVN/GW35AK). ## Overview **Subjects**: Arts and Humanities; Computer and Information Science; Social Sciences; Other **Keywords**: Corpus Linguistics; Dialectology; Spoken Language; Speech Transcripts; Australia; New Zealand The [Corpus of Australian and New Zealand Spoken English (CoANZSE)](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FGW35AK) is a **196-million-word corpus** of **geolocated automatic speech recognition (ASR) YouTube transcripts** from **local government channels** in **Australia** and **New Zealand**, created for the study of lexical, grammatical, and discourse-pragmatic phenomena of spoken language, as well as for content and language analysis in digital humanities and social science fields. Annotation includes **individual word timings** and **video IDs** of transcripts, making it easy to instantly view the video(s) for any given search. The corpus was created from **55,896 ASR transcripts** from **472 YouTube channels**, corresponding to almost **24,007 hours of video**. The size of the corpus is **195,583,873 tokens**. The channels sampled in the corpus are associated with **local government entities** such as **local, city, county, district, and regional councils**, and transcripts are from a range of video types. Recordings of public meetings are well-represented. Related resources are the [Corpus of North American Spoken English](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X8QJJV) and the [Corpus of British Isles Spoken English](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UGIIWD). ## Data Source The current dataset is cleaned by [Xinyu Mao](https://huggingface.co/XinyuMao) and [Yifan Luo](https://huggingface.co/yifan-luo). The dataset can also be found on [GitHub](https://github.com/southern-cross-ai/CoANZSE).

提供机构：

SouthernCrossAI

5,000+

优质数据集

54 个

任务类型

进入经典数据集