five

AlienKevin/sbs_cantonese

收藏
Hugging Face2023-10-15 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/AlienKevin/sbs_cantonese
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - yue pretty_name: SBS Cantonese Speech Corpus size_categories: - 100K<n<1M --- # SBS Cantonese Speech Corpus This speech corpus contains **435 hours** of [SBS Cantonese](https://www.sbs.com.au/language/chinese/zh-hant/podcast/sbs-cantonese) podcasts from Auguest 2022 to October 2023. There are **2,519 episodes** and each episode is split into segments that are at most 10 seconds long. In total, there are **189,216 segments** in this corpus. Here is a breakdown on the categories of episodes present in this dataset: <style> table th:first-of-type { width: 5%; } table th:nth-of-type(2) { width: 15%; } table th:nth-of-type(3) { width: 50%; } </style> | Category | SBS Channels | Episodes | |-------------------|----------------------|-------| | news | 中文新聞, 新聞簡報 | 622 | | business | 寰宇金融 | 148 | | vaccine | 疫苗快報 | 71 | | gardening | 園藝趣談 | 58 | | tech | 科技世界 | 56 | | health | 健康快樂人 | 53 | | culture | 文化360 | 49 | | english | 學英語 | 41 | | expert | 專家話你知 | 37 | | interview | 我不是名人 | 20 | | career | 澳洲招職 | 18 | | food | 美食速遞 | 18 | | uncategorized | n/a | 1328 | * Uncategorized episodes are mostly news but also contains other categories listed above. ## Dataset Details ### Dataset Description <!-- Provide a longer summary of what this dataset is. --> - **Curated by:** Kevin Li - **Language(s):** Cantonese, English (only in podcasts categorized as "english") - **License:** Creative Commons Attribution Non-Commercial 4.0 ### Scraper <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/AlienKevin/sbs_cantonese ## Uses Each episode is split into segments using [silero-vad](https://github.com/snakers4/silero-vad). Since silero-vad is not trained on Cantonese data, the segmentation is not ideal and often break sentences in the middle. Hence, this dataset is not intended to be used for supervised ASR. Instead, it is intended to be used for self-supervised speech pretraining, like training WavLM, HuBERT, and Wav2Vec. ### Format Each segment is stored as a monochannel FLAC file with a sample rate of 16k Hz. You can find the segments under the `audio/` folder, where groups of segments are bundled into a .tar.gz file for ease of distribution. The filename of the segment shows which episodes it belongs to and place of it within that episode: For example, here's a filename: ``` 0061gy0w8_0000_5664_81376 ``` where * `0061gy0w8` is the episode id * `0000` means that it is the first segment of that episode * `5664` is the starting sample of this segment. Remember all episodes are sampled at 16k Hz, so the total number of samples in an episode is (the duration in seconds * 16,000). * `81376` is the ending (exclusive) sample of this segment. ### Metadata Metadata for each episode is stored in the `metadata.jsonl` file, where each line stores the metadata for one episode: Here's the metadata for one of the episodes (split into multiple lines for clarity): ```json { "title": "SBS 中文新聞 (7月5日)", "date": "05/07/2023", "view_more_link": "https://www.sbs.com.au/language/chinese/zh-hant/podcast-episode/chinese-news-5-7-2023/tl6s68rdk", "download_link": "https://sbs-podcast.streamguys1.com/sbs-cantonese/20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0.mp3?awCollectionId=sbs-cantonese&awGenre=News&awEpisodeId=20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0" } ``` where * `title` is the title of the episode * `date` is the date when the episode is published * `view_more_link` is a link to the associated article/description for this episode. Many news episodes have extremely detailed manuscripts written in Traditional Chinese while others have briefer summaries or key points available. * `download_link` is the link to download the audio for this episode. It is usually hosted on [streamguys](https://www.streamguys.com/) but some earlier episodes are stored SBS's own server at https://images.sbs.com.au. The id of each episode appears at the end of its `view_more_link`. It appears to be a precomputed hash that is unique to each episode. ```python id = view_more_link.split("/")[-1] ```
提供机构:
AlienKevin
原始信息汇总

SBS Cantonese Speech Corpus

数据集概述

  • 时长: 435小时
  • 时间段: 2022年8月至2023年10月
  • 来源: SBS Cantonese 播客
  • 集数: 2,519集
  • 分段: 每集分为最多10秒的片段,共189,216个片段

分类统计

分类 SBS频道 集数
新闻 中文新聞, 新聞簡報 622
商业 寰宇金融 148
疫苗 疫苗快報 71
园艺 園藝趣談 58
科技 科技世界 56
健康 健康快樂人 53
文化 文化360 49
英语 學英語 41
专家 專家話你知 37
采访 我不是名人 20
职业 澳洲招職 18
食物 美食速遞 18
未分类 n/a 1328
  • 未分类的集数主要是新闻,但也包含上述其他分类。

数据集详情

数据集描述

  • 策划者: Kevin Li
  • 语言: 粤语,英语(仅在“英语”分类的播客中)
  • 许可证: Creative Commons Attribution Non-Commercial 4.0

使用说明

  • 分段工具: 使用 silero-vad 进行分段,但由于该工具未针对粤语数据进行训练,分段效果不理想,常在句子中间断句。
  • 适用场景: 不建议用于监督式自动语音识别(ASR),更适合用于自监督语音预训练,如训练 WavLM、HuBERT 和 Wav2Vec。

格式

  • 音频格式: 每个分段存储为单声道 FLAC 文件,采样率为 16k Hz。
  • 文件命名: 文件名显示分段所属的集数及在该集中的位置。例如:0061gy0w8_0000_5664_81376,其中 0061gy0w8 是集数ID,0000 表示该集的第一个分段,5664 是该分段的起始采样点,81376 是结束采样点(不包括)。

元数据

  • 元数据文件: metadata.jsonl,每行存储一个集数的元数据。

  • 元数据示例: json { "title": "SBS 中文新聞 (7月5日)", "date": "05/07/2023", "view_more_link": "https://www.sbs.com.au/language/chinese/zh-hant/podcast-episode/chinese-news-5-7-2023/tl6s68rdk", "download_link": "https://sbs-podcast.streamguys1.com/sbs-cantonese/20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0.mp3?awCollectionId=sbs-cantonese&awGenre=News&awEpisodeId=20230705105920-cantonese-0288b7c2-cb6d-4e0e-aec2-2680dd8738e0" }

    • title:集数标题
    • date:发布日期
    • view_more_link:相关文章/描述链接
    • download_link:音频下载链接
  • 集数ID: 集数ID出现在 view_more_link 的末尾,是一个预计算的哈希值,每个集数唯一。 python id = view_more_link.split("/")[-1]

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作