TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus

Name: TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus
Creator: TingChen-ppmc
Published: 2023-12-20 15:49:23
License: 暂无描述

Hugging Face2023-12-20 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio - name: gender dtype: string - name: speaker_id dtype: string - name: transcription dtype: string splits: - name: train num_bytes: 223664136.256 num_examples: 1488 download_size: 215320750 dataset_size: 223664136.256 configs: - config_name: default data_files: - split: train path: data/train-* --- # Corpus This dataset is built from Magicdata [ASR-CCHSHDIACSC: A CHINESE CHANGSHA DIALECT CONVERSATIONAL SPEECH CORPUS](https://magichub.com/datasets/changsha-dialect-conversational-speech-corpus/) This corpus is licensed under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). Please refer to the license for further information. Modifications: The audio is split in sentences based on the time span on the transcription file. Sentences that span less than 1 second is discarded. Topics of conversation is removed. # Usage To load this dataset, use ```python from datasets import load_dataset dialect_corpus = load_dataset("TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus") ``` This dataset only has train split. To split out a test split, use ```python from datasets import load_dataset train_split = load_dataset("TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus", split="train") # where test=0.5 denotes 0.5 of the dataset will be split to test split corpus = train_split.train_test_split(test=0.5) ``` A sample data would be ```python # note this data is from the Nanchang Dialect corpus, the data format is shared {'audio': {'path': 'A0001_S001_0_G0001_0.WAV', 'array': array([-0.00030518, -0.00039673, -0.00036621, ..., -0.00064087, -0.00015259, -0.00042725]), 'sampling_rate': 16000}, 'gender': '女', 'speaker_id': 'G0001', 'transcription': '北京爱数智慧语音采集' } ``` [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

TingChen-ppmc

原始信息汇总

数据集概述

数据集信息

特征:
- audio: 音频数据
- gender: 说话人性别
- speaker_id: 说话人标识
- transcription: 转录文本
分割:
- train: 训练集，包含1488个样本，大小为223664136.256字节
大小:
- 下载大小: 215320750字节
- 数据集大小: 223664136.256字节

配置

默认配置:
- 数据文件路径: data/train-*

数据集加载

python from datasets import load_dataset dialect_corpus = load_dataset("TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus")

数据样本

python {audio: {path: A0001_S001_0_G0001_0.WAV, array: array([-0.00030518, -0.00039673, -0.00036621, ..., -0.00064087, -0.00015259, -0.00042725]), sampling_rate: 16000}, gender: 女, speaker_id: G0001, transcription: 北京爱数智慧语音采集 }

5,000+

优质数据集

54 个

任务类型

进入经典数据集