amithm3/shrutilipi
收藏Hugging Face2024-04-11 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/amithm3/shrutilipi
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- kn
- sa
- bn
- pa
- ml
- gu
- ta
- te
- hi
- mr
license: apache-2.0
size_categories:
- 1M<n<10M
task_categories:
- automatic-speech-recognition
pretty_name: AI4Bharat Shrutilipi ASR Dataset
dataset_info:
- config_name: bn
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 59658532357.726
num_examples: 302349
- name: validation
num_bytes: 6723169844.11
num_examples: 37602
- name: test
num_bytes: 7660623563.6
num_examples: 38740
download_size: 74278694994
dataset_size: 74042325765.436
- config_name: gu
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 55793674372.628
num_examples: 329931
- name: validation
num_bytes: 6293796356.189
num_examples: 40773
- name: test
num_bytes: 7165218289.408
num_examples: 40853
download_size: 78346523702
dataset_size: 69252689018.225
- config_name: hi
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 213699256456.296
num_examples: 877604
- name: validation
num_bytes: 27583551082.248
num_examples: 110692
- name: test
num_bytes: 25110580660.236
num_examples: 108492
download_size: 269912939092
dataset_size: 266393388198.78
- config_name: kn
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 54770494386.876
num_examples: 278766
- name: validation
num_bytes: 7864058142.98
num_examples: 34726
- name: test
num_bytes: 7572538417.28
num_examples: 35166
download_size: 74257809304
dataset_size: 70207090947.136
- config_name: ml
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 71262913087.942
num_examples: 467414
- name: validation
num_bytes: 7751159979.48
num_examples: 58985
- name: test
num_bytes: 8930337765.4
num_examples: 59230
download_size: 99439381074
dataset_size: 87944410832.82199
- config_name: mr
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 125894833883.753
num_examples: 505639
- name: validation
num_bytes: 14280421505.308
num_examples: 63407
- name: test
num_bytes: 15230198579.815
num_examples: 63397
download_size: 147608513634
dataset_size: 155405453968.876
- config_name: pa
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 11549437955.164
num_examples: 41874
- name: validation
num_bytes: 1317876276.359
num_examples: 5311
- name: test
num_bytes: 1453641173.132
num_examples: 5139
download_size: 13966090670
dataset_size: 14320955404.654999
- config_name: sa
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 6219394271.104
num_examples: 11532
- name: validation
num_bytes: 718650848.144
num_examples: 1408
- name: test
num_bytes: 752806235.026
num_examples: 1474
download_size: 7321556791
dataset_size: 7690851354.274
- config_name: ta
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 101739123587.681
num_examples: 429417
- name: validation
num_bytes: 12903430948.456
num_examples: 54012
- name: test
num_bytes: 12724306851.984
num_examples: 53827
download_size: 126128595816
dataset_size: 127366861388.12099
- config_name: te
features:
- name: audio
dtype: audio
- name: transcription
dtype: string
splits:
- name: train
num_bytes: 33158344172.292
num_examples: 155322
- name: validation
num_bytes: 4085414503.579
num_examples: 19501
- name: test
num_bytes: 4173443926.076
num_examples: 19189
download_size: 43278403108
dataset_size: 41417202601.94701
configs:
- config_name: bn
data_files:
- split: train
path: data/bn/train-*
- split: validation
path: data/bn/validation-*
- split: test
path: data/bn/test-*
- config_name: gu
data_files:
- split: train
path: data/gu/train-*
- split: validation
path: data/gu/validation-*
- split: test
path: data/gu/test-*
- config_name: hi
data_files:
- split: train
path: data/hi/train-*
- split: validation
path: data/hi/validation-*
- split: test
path: data/hi/test-*
- config_name: kn
data_files:
- split: train
path: data/kn/train-*
- split: validation
path: data/kn/validation-*
- split: test
path: data/kn/test-*
- config_name: ml
data_files:
- split: train
path: data/ml/train-*
- split: validation
path: data/ml/validation-*
- split: test
path: data/ml/test-*
- config_name: mr
data_files:
- split: train
path: data/mr/train-*
- split: validation
path: data/mr/validation-*
- split: test
path: data/mr/test-*
- config_name: pa
data_files:
- split: train
path: data/pa/train-*
- split: validation
path: data/pa/validation-*
- split: test
path: data/pa/test-*
- config_name: sa
data_files:
- split: train
path: data/sa/train-*
- split: validation
path: data/sa/validation-*
- split: test
path: data/sa/test-*
- config_name: ta
data_files:
- split: train
path: data/ta/train-*
- split: validation
path: data/ta/validation-*
- split: test
path: data/ta/test-*
- config_name: te
data_files:
- split: train
path: data/te/train-*
- split: validation
path: data/te/validation-*
- split: test
path: data/te/test-*
tags:
- audio
- transcription
- AI4Bharat
- shrutilipi
---
AI4Bharat Shrutilipi ASR 数据集是一个用于自动语音识别任务的数据集,支持多种印度语言,包括卡纳达语、梵语、孟加拉语、旁遮普语、马拉雅拉姆语、古吉拉特语、泰米尔语、泰卢固语、印地语和马拉地语。数据集包含音频和转录数据,分为训练集、验证集和测试集,每个集都有详细的字节数和示例数。数据集采用 Apache 2.0 许可证,规模在100万到1000万条之间。
提供机构:
amithm3
原始信息汇总
数据集概述
基本信息
- 名称: AI4Bharat Shrutilipi ASR Dataset
- 语言: 包含以下语言:kn, sa, bn, pa, ml, gu, ta, te, hi, mr
- 许可证: Apache-2.0
- 大小: 每个语言数据集大小介于1M至10M之间
任务类别
- 任务: 自动语音识别(Automatic Speech Recognition)
数据集详细配置
1. 孟加拉语 (bn)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 302349个样本,总字节数59658532357.726
- 验证集: 37602个样本,总字节数6723169844.11
- 测试集: 38740个样本,总字节数7660623563.6
- 下载大小: 74278694994字节
- 数据集大小: 74042325765.436字节
2. 古吉拉特语 (gu)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 329931个样本,总字节数55793674372.628
- 验证集: 40773个样本,总字节数6293796356.189
- 测试集: 40853个样本,总字节数7165218289.408
- 下载大小: 78346523702字节
- 数据集大小: 69252689018.225字节
3. 印地语 (hi)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 877604个样本,总字节数213699256456.296
- 验证集: 110692个样本,总字节数27583551082.248
- 测试集: 108492个样本,总字节数25110580660.236
- 下载大小: 269912939092字节
- 数据集大小: 266393388198.78字节
4. 卡纳达语 (kn)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 278766个样本,总字节数54770494386.876
- 验证集: 34726个样本,总字节数7864058142.98
- 测试集: 35166个样本,总字节数7572538417.28
- 下载大小: 74257809304字节
- 数据集大小: 70207090947.136字节
5. 马拉雅拉姆语 (ml)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 467414个样本,总字节数71262913087.942
- 验证集: 58985个样本,总字节数7751159979.48
- 测试集: 59230个样本,总字节数8930337765.4
- 下载大小: 99439381074字节
- 数据集大小: 87944410832.82199字节
6. 马拉地语 (mr)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 505639个样本,总字节数125894833883.753
- 验证集: 63407个样本,总字节数14280421505.308
- 测试集: 63397个样本,总字节数15230198579.815
- 下载大小: 147608513634字节
- 数据集大小: 155405453968.876字节
7. 旁遮普语 (pa)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 41874个样本,总字节数11549437955.164
- 验证集: 5311个样本,总字节数1317876276.359
- 测试集: 5139个样本,总字节数1453641173.132
- 下载大小: 13966090670字节
- 数据集大小: 14320955404.654999字节
8. 梵语 (sa)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 11532个样本,总字节数6219394271.104
- 验证集: 1408个样本,总字节数718650848.144
- 测试集: 1474个样本,总字节数752806235.026
- 下载大小: 7321556791字节
- 数据集大小: 7690851354.274字节
9. 泰米尔语 (ta)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 429417个样本,总字节数101739123587.681
- 验证集: 54012个样本,总字节数12903430948.456
- 测试集: 53827个样本,总字节数12724306851.984
- 下载大小: 126128595816字节
- 数据集大小: 127366861388.12099字节
10. 泰卢固语 (te)
- 特征:
- 音频 (audio)
- 转录文本 (transcription)
- 分割:
- 训练集: 155322个样本,总字节数33158344172.292
- 验证集: 19501个样本,总字节数4085414503.579
- 测试集: 19189个样本,总字节数4173443926.076
- 下载大小: 43278403108字节
- 数据集大小: 41417202601.94701字节
标签
- 音频
- 转录文本
- AI4Bharat
- Shrutilipi
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



