TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus
收藏Hugging Face2023-12-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: gender
dtype: string
- name: speaker_id
dtype: string
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 223664136.256
num_examples: 1488
download_size: 215320750
dataset_size: 223664136.256
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Corpus
This dataset is built from Magicdata [ASR-CCHSHDIACSC: A CHINESE CHANGSHA DIALECT CONVERSATIONAL SPEECH CORPUS](https://magichub.com/datasets/changsha-dialect-conversational-speech-corpus/)
This corpus is licensed under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/). Please refer to the license for further information.
Modifications: The audio is split in sentences based on the time span on the transcription file. Sentences that span less than 1 second is discarded. Topics of conversation is removed.
# Usage
To load this dataset, use
```python
from datasets import load_dataset
dialect_corpus = load_dataset("TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus")
```
This dataset only has train split. To split out a test split, use
```python
from datasets import load_dataset
train_split = load_dataset("TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus", split="train")
# where test=0.5 denotes 0.5 of the dataset will be split to test split
corpus = train_split.train_test_split(test=0.5)
```
A sample data would be
```python
# note this data is from the Nanchang Dialect corpus, the data format is shared
{'audio':
{'path': 'A0001_S001_0_G0001_0.WAV',
'array': array([-0.00030518, -0.00039673,
-0.00036621, ..., -0.00064087,
-0.00015259, -0.00042725]),
'sampling_rate': 16000},
'gender': '女',
'speaker_id': 'G0001',
'transcription': '北京爱数智慧语音采集'
}
```
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
提供机构:
TingChen-ppmc
原始信息汇总
数据集概述
数据集信息
-
特征:
audio: 音频数据gender: 说话人性别speaker_id: 说话人标识transcription: 转录文本
-
分割:
train: 训练集,包含1488个样本,大小为223664136.256字节
-
大小:
- 下载大小: 215320750字节
- 数据集大小: 223664136.256字节
配置
- 默认配置:
- 数据文件路径:
data/train-*
- 数据文件路径:
数据集加载
python from datasets import load_dataset dialect_corpus = load_dataset("TingChen-ppmc/Changsha_Dialect_Conversational_Speech_Corpus")
数据样本
python {audio: {path: A0001_S001_0_G0001_0.WAV, array: array([-0.00030518, -0.00039673, -0.00036621, ..., -0.00064087, -0.00015259, -0.00042725]), sampling_rate: 16000}, gender: 女, speaker_id: G0001, transcription: 北京爱数智慧语音采集 }



