AlienKevin/sbs_cantonese

Name: AlienKevin/sbs_cantonese
Creator: AlienKevin
Published: 2023-10-15 21:57:53
License: 暂无描述

Hugging Face2023-10-15 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/AlienKevin/sbs_cantonese

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-4.0 language: - yue pretty_name: SBS Cantonese Speech Corpus size_categories: - 100K<n<1M --- # SBS Cantonese Speech Corpus This speech corpus contains **435 hours** of [SBS Cantonese](https://www.sbs.com.au/language/chinese/zh-hant/podcast/sbs-cantonese) podcasts from Auguest 2022 to October 2023. There are **2,519 episodes** and each episode is split into segments that are at most 10 seconds long. In total, there are **189,216 segments** in this corpus. Here is a breakdown on the categories of episodes present in this dataset: <style> table th:first-of-type { width: 5%; } table th:nth-of-type(2) { width: 15%; } table th:nth-of-type(3) { width: 50%; } </style> | Category | SBS Channels | Episodes | |-------------------|----------------------|-------| | news | 中文新聞, 新聞簡報 | 622 | | business | 寰宇金融 | 148 | | vaccine | 疫苗快報 | 71 | | gardening | 園藝趣談 | 58 | | tech | 科技世界 | 56 | | health | 健康快樂人 | 53 | | culture | 文化360 | 49 | | english | 學英語 | 41 | | expert | 專家話你知 | 37 | | interview | 我不是名人 | 20 | | career | 澳洲招職 | 18 | | food | 美食速遞 | 18 | | uncategorized | n/a | 1328 | * Uncategorized episodes are mostly news but also contains other categories listed above. ## Dataset Details ### Dataset Description  - **Curated by:** Kevin Li - **Language(s):** Cantonese, English (only in podcasts categorized as "english") - **License:** Creative Commons Attribution Non-Commercial 4.0 ### Scraper  - **Repository:** https://github.com/AlienKevin/sbs_cantonese ## Uses Each episode is split into segments using [silero-vad](https://github.com/snakers4/silero-vad). Since silero-vad is not trained on Cantonese data, the segmentation is not ideal and often break sentences in the middle. Hence, this dataset is not intended to be used for supervised ASR. Instead, it is intended to be used for self-supervised speech pretraining, like training WavLM, HuBERT, and Wav2Vec. ### Format Each segment is stored as a monochannel FLAC file with a sample rate of 16k Hz. You can find the segments under the `audio/` folder, where groups of segments are bundled into a .tar.gz file for ease of distribution. The filename of the segment shows which episodes it belongs to and place of it within that episode: For example, here's a filename: ``` 0061gy0w8_0000_5664_81376 ``` where * `0061gy0w8` is the episode id * `0000` means that it is the first segment of that episode * `5664` is the starting sample of this segment. Remember all episodes are sampled at 16k Hz, so the total number of samples in an episode is (the duration in seconds * 16,000). * `81376` is the ending (exclusive) sample of this segment. ### Metadata Metadata for each episode is stored in the `metadata.jsonl` file, where each line stores the metadata for one episode: Here's the metadata for one of the episodes (split into multiple lines for clarity): ```json { "title": "SBS 中文新聞（7月5日）", "date": "05/07/2023", "view_more_link": "https://www.sbs.com.au/language/chinese/zh-hant/podcast-episode/chinese-news-5-7-2023/tl6s68rdk", "download_link": "https://sbs-podcast.streamguys1.com/sbs-cantonese/20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0.mp3?awCollectionId=sbs-cantonese&awGenre=News&awEpisodeId=20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0" } ``` where * `title` is the title of the episode * `date` is the date when the episode is published * `view_more_link` is a link to the associated article/description for this episode. Many news episodes have extremely detailed manuscripts written in Traditional Chinese while others have briefer summaries or key points available. * `download_link` is the link to download the audio for this episode. It is usually hosted on [streamguys](https://www.streamguys.com/) but some earlier episodes are stored SBS's own server at https://images.sbs.com.au. The id of each episode appears at the end of its `view_more_link`. It appears to be a precomputed hash that is unique to each episode. ```python id = view_more_link.split("/")[-1] ```

提供机构：

AlienKevin

原始信息汇总

SBS Cantonese Speech Corpus

数据集概述

时长： 435小时
时间段： 2022年8月至2023年10月
来源： SBS Cantonese 播客
集数： 2,519集
分段： 每集分为最多10秒的片段，共189,216个片段

分类统计

分类	SBS频道	集数
新闻	中文新聞, 新聞簡報	622
商业	寰宇金融	148
疫苗	疫苗快報	71
园艺	園藝趣談	58
科技	科技世界	56
健康	健康快樂人	53
文化	文化360	49
英语	學英語	41
专家	專家話你知	37
采访	我不是名人	20
职业	澳洲招職	18
食物	美食速遞	18
未分类	n/a	1328

未分类的集数主要是新闻，但也包含上述其他分类。

数据集详情

数据集描述

策划者： Kevin Li
语言： 粤语，英语（仅在“英语”分类的播客中）
许可证： Creative Commons Attribution Non-Commercial 4.0

使用说明

分段工具： 使用 silero-vad 进行分段，但由于该工具未针对粤语数据进行训练，分段效果不理想，常在句子中间断句。
适用场景： 不建议用于监督式自动语音识别（ASR），更适合用于自监督语音预训练，如训练 WavLM、HuBERT 和 Wav2Vec。

格式

音频格式： 每个分段存储为单声道 FLAC 文件，采样率为 16k Hz。
文件命名： 文件名显示分段所属的集数及在该集中的位置。例如：0061gy0w8_0000_5664_81376，其中 0061gy0w8 是集数ID，0000 表示该集的第一个分段，5664 是该分段的起始采样点，81376 是结束采样点（不包括）。

元数据

元数据文件： metadata.jsonl，每行存储一个集数的元数据。
元数据示例： json { "title": "SBS 中文新聞（7月5日）", "date": "05/07/2023", "view_more_link": "https://www.sbs.com.au/language/chinese/zh-hant/podcast-episode/chinese-news-5-7-2023/tl6s68rdk", "download_link": "https://sbs-podcast.streamguys1.com/sbs-cantonese/20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0.mp3?awCollectionId=sbs-cantonese&awGenre=News&awEpisodeId=20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0" }
- title：集数标题
- date：发布日期
- view_more_link：相关文章/描述链接
- download_link：音频下载链接
集数ID： 集数ID出现在 view_more_link 的末尾，是一个预计算的哈希值，每个集数唯一。 python id = view_more_link.split("/")[-1]

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集