imvladikon/hebrew_speech_kan
收藏Hugging Face2023-05-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/imvladikon/hebrew_speech_kan
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- automatic-speech-recognition
language:
- he
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: sentence
dtype: string
splits:
- name: train
num_bytes: 1569850175.0
num_examples: 8000
- name: validation
num_bytes: 394275049.0
num_examples: 2000
download_size: 1989406585
dataset_size: 1964125224.0
---
# Dataset Card for Dataset Name
## Dataset Description
- **Homepage:**
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
Hebrew Dataset for ASR
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
```json
{'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/8ce7402f6482c6053251d7f3000eec88668c994beb48b7ca7352e77ef810a0b6/train/e429593fede945c185897e378a5839f4198.wav',
'array': array([-0.00265503, -0.0018158 , -0.00149536, ..., -0.00135803,
-0.00231934, -0.00190735]),
'sampling_rate': 16000},
'sentence': 'היא מבינה אותי יותר מכל אחד אחר'}
```
### Data Fields
[More Information Needed]
### Data Splits
| | train | validation |
| ---- | ----- | ---------- |
| number of samples | 8000 | 2000 |
| hours | 6.92 | 1.73 |
## Dataset Creation
### Curation Rationale
scraped data from youtube (channel כאן) with removing outliers (by length and ratio between length of the audio and sentences)
### Source Data
#### Initial Data Collection and Normalization
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```
@misc{imvladikon2022hebrew_speech_kan,
author = {Gurevich, Vladimir},
title = {Hebrew Speech Recognition Dataset: Kan},
year = {2022},
howpublished = \url{https://huggingface.co/datasets/imvladikon/hebrew_speech_kan},
}
```
### Contributions
[More Information Needed]
提供机构:
imvladikon
原始信息汇总
数据集概述
- 任务类别: 自动语音识别
- 语言: 希伯来语
- 数据集大小: 1000 < n < 10000
数据集特征
- 音频特征:
- 名称: audio
- 数据类型:
- 采样率: 16000
- 文本特征:
- 名称: sentence
- 数据类型: 字符串
数据集分割
| 训练集 | 验证集 | |
|---|---|---|
| 样本数量 | 8000 | 2000 |
| 字节数 | 1569850175.0 | 394275049.0 |
| 下载大小 | 1989406585 | 1964125224.0 |
数据集创建
- 数据来源: 从YouTube频道“כאן”抓取数据,移除了长度异常和音频与句子长度比例异常的数据。



