kuanhuggingface/google_tts_speech_tokenizer
收藏Hugging Face2023-11-14 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kuanhuggingface/google_tts_speech_tokenizer
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: file_id
dtype: string
- name: instruction
dtype: string
- name: transcription
dtype: string
- name: src_speech_tokenizer_0
sequence: int64
- name: src_speech_tokenizer_1
sequence: int64
- name: src_speech_tokenizer_2
sequence: int64
- name: src_speech_tokenizer_3
sequence: int64
- name: src_speech_tokenizer_4
sequence: int64
- name: src_speech_tokenizer_5
sequence: int64
- name: src_speech_tokenizer_6
sequence: int64
- name: src_speech_tokenizer_7
sequence: int64
- name: tgt_speech_tokenizer_0
sequence: int64
- name: tgt_speech_tokenizer_1
sequence: int64
- name: tgt_speech_tokenizer_2
sequence: int64
- name: tgt_speech_tokenizer_3
sequence: int64
- name: tgt_speech_tokenizer_4
sequence: int64
- name: tgt_speech_tokenizer_5
sequence: int64
- name: tgt_speech_tokenizer_6
sequence: int64
- name: tgt_speech_tokenizer_7
sequence: int64
splits:
- name: train
num_bytes: 2475675704
num_examples: 90000
- name: validation
num_bytes: 135727316
num_examples: 5000
- name: test
num_bytes: 139731511
num_examples: 5000
download_size: 147517599
dataset_size: 2751134531
---
# Dataset Card for "google_tts_speech_tokenizer"
[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
数据集信息:
特征项列表:
- 特征项:文件标识符(file_id),数据类型:字符串型
- 特征项:指令(instruction),数据类型:字符串型
- 特征项:语音转录文本(transcription),数据类型:字符串型
- 特征项:源语音分词器0(src_speech_tokenizer_0),数据类型:64位整数序列型
- 特征项:源语音分词器1(src_speech_tokenizer_1),数据类型:64位整数序列型
- 特征项:源语音分词器2(src_speech_tokenizer_2),数据类型:64位整数序列型
- 特征项:源语音分词器3(src_speech_tokenizer_3),数据类型:64位整数序列型
- 特征项:源语音分词器4(src_speech_tokenizer_4),数据类型:64位整数序列型
- 特征项:源语音分词器5(src_speech_tokenizer_5),数据类型:64位整数序列型
- 特征项:源语音分词器6(src_speech_tokenizer_6),数据类型:64位整数序列型
- 特征项:源语音分词器7(src_speech_tokenizer_7),数据类型:64位整数序列型
- 特征项:目标语音分词器0(tgt_speech_tokenizer_0),数据类型:64位整数序列型
- 特征项:目标语音分词器1(tgt_speech_tokenizer_1),数据类型:64位整数序列型
- 特征项:目标语音分词器2(tgt_speech_tokenizer_2),数据类型:64位整数序列型
- 特征项:目标语音分词器3(tgt_speech_tokenizer_3),数据类型:64位整数序列型
- 特征项:目标语音分词器4(tgt_speech_tokenizer_4),数据类型:64位整数序列型
- 特征项:目标语音分词器5(tgt_speech_tokenizer_5),数据类型:64位整数序列型
- 特征项:目标语音分词器6(tgt_speech_tokenizer_6),数据类型:64位整数序列型
- 特征项:目标语音分词器7(tgt_speech_tokenizer_7),数据类型:64位整数序列型
数据集划分:
- 划分名:训练集(train),字节数:2475675704,样本数量:90000
- 划分名:验证集(validation),字节数:135727316,样本数量:5000
- 划分名:测试集(test),字节数:139731511,样本数量:5000
下载大小:147517599
数据集总大小:2751134531
---
# 「google_tts_speech_tokenizer」数据集卡片
[需补充更多信息](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
##
提供机构:
kuanhuggingface
原始信息汇总
数据集概述
特征信息
- file_id: 字符串类型
- instruction: 字符串类型
- transcription: 字符串类型
- src_speech_tokenizer_0: 整数序列类型
- src_speech_tokenizer_1: 整数序列类型
- src_speech_tokenizer_2: 整数序列类型
- src_speech_tokenizer_3: 整数序列类型
- src_speech_tokenizer_4: 整数序列类型
- src_speech_tokenizer_5: 整数序列类型
- src_speech_tokenizer_6: 整数序列类型
- src_speech_tokenizer_7: 整数序列类型
- tgt_speech_tokenizer_0: 整数序列类型
- tgt_speech_tokenizer_1: 整数序列类型
- tgt_speech_tokenizer_2: 整数序列类型
- tgt_speech_tokenizer_3: 整数序列类型
- tgt_speech_tokenizer_4: 整数序列类型
- tgt_speech_tokenizer_5: 整数序列类型
- tgt_speech_tokenizer_6: 整数序列类型
- tgt_speech_tokenizer_7: 整数序列类型
数据分割
- train:
- 字节数: 2475675704
- 样本数: 90000
- validation:
- 字节数: 135727316
- 样本数: 5000
- test:
- 字节数: 139731511
- 样本数: 5000
数据集大小
- 下载大小: 147517599
- 数据集大小: 2751134531



