imvladikon/hebrew_speech_campus
收藏Hugging Face2023-11-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/imvladikon/hebrew_speech_campus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- he
size_categories:
- 10K<n<100K
task_categories:
- automatic-speech-recognition
dataset_info:
features:
- name: uid
dtype: string
- name: file_id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: sentence
dtype: string
- name: n_segment
dtype: int32
- name: duration_ms
dtype: float32
- name: language
dtype: string
- name: sample_rate
dtype: int32
- name: course
dtype: string
- name: sentence_length
dtype: int32
- name: n_tokens
dtype: int32
splits:
- name: train
num_bytes: 17559119499.576
num_examples: 75924
download_size: 17274739665
dataset_size: 17559119499.576
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
## Data Description
Hebrew Speech Recognition dataset from [Campus IL](https://campus.gov.il/).
Data was scraped from the Campus website, which contains video lectures from various courses in Hebrew.
Then subtitles were extracted from the videos and aligned with the audio.
Subtitles that are not on Hebrew were removed (WIP: need to remove non-Hebrew audio as well, e.g. using simple classifier).
Samples with duration less than 3 second were removed.
Total duration of the dataset is 152 hours.
Outliers in terms of the duration/char ratio were not removed, so it's possible to find suspiciously long or short sentences compared to the duration.
Note: if loading is slow, just clone it :
`git clone hebrew_speech_campus && cd hebrew_speech_campus && git lfs pull`
and load it from the folder `load_dataset("./hebrew_speech_campus")`
## Data Format
Audio files are in WAV format, 16kHz sampling rate, 16bit, mono. Ignore `path` field, use `audio.array` field value.
## Data Usage
```python
from datasets import load_dataset
ds = load_dataset("imvladikon/hebrew_speech_campus", split="train", streaming=True)
print(next(iter(ds)))
```
## Data Sample
```
{'uid': '10c3eda27cf173ab25bde755d0023abed301fcfd',
'file_id': '10c3eda27cf173ab25bde755d0023abed301fcfd_13',
'audio': {'path': '/content/hebrew_speech_campus/data/from_another_angle-_mathematics_teaching_practices/10c3eda27cf173ab25bde755d0023abed301fcfd_13.wav',
'array': array([ 5.54326562e-07, 3.60812592e-05, -2.35188054e-04, ...,
2.34067178e-04, 1.55649337e-04, 6.32447700e-05]),
'sampling_rate': 16000},
'sentence': 'הדוברים צריכים לקחת עליו אחריות, ולהיות מחויבים לו כלומר, השיח צריך להיות מחויב',
'n_segment': 13,
'duration_ms': 6607.98193359375,
'language': 'he',
'sample_rate': 16000,
'course': 'from_another_angle-_mathematics_teaching_practices',
'sentence_length': 79,
'n_tokens': 13}
```
## Data Splits and Stats
Split: train
Number of samples: 75924
## Citation
Please cite the following if you use this dataset in your work:
```
@misc{imvladikon2023hebrew_speech_campus,
author = {Gurevich, Vladimir},
title = {Hebrew Speech Recognition Dataset: Campus},
year = {2023},
howpublished = \url{https://huggingface.co/datasets/imvladikon/hebrew_speech_campus},
}
```
提供机构:
imvladikon
原始信息汇总
数据集概述
语言
- 希伯来语(he)
数据规模
- 数据量:10K<n<100K
任务类别
- 自动语音识别(automatic-speech-recognition)
数据集信息
-
特征字段
uid: 字符串类型file_id: 字符串类型audio: 音频类型,采样率16000sentence: 字符串类型n_segment: 整数类型duration_ms: 浮点数类型language: 字符串类型sample_rate: 整数类型course: 字符串类型sentence_length: 整数类型n_tokens: 整数类型
-
数据分割
train: 训练集,包含75924个样本,总字节数为17559119499.576
数据格式
- 音频文件格式:WAV
- 采样率:16kHz
- 位深度:16bit
- 声道:单声道
数据样本
- 示例数据包含以下字段:
uid: 样本唯一标识file_id: 文件标识audio: 音频数据,包含路径、数组和采样率sentence: 对应的句子文本n_segment: 段落编号duration_ms: 持续时间(毫秒)language: 语言标识sample_rate: 采样率course: 课程名称sentence_length: 句子长度n_tokens: 句子中的词数
数据分割和统计
- 训练集(train):包含75924个样本
引用
-
使用此数据集时,请引用以下信息:
@misc{imvladikon2023hebrew_speech_campus, author = {Gurevich, Vladimir}, title = {Hebrew Speech Recognition Dataset: Campus}, year = {2023}, howpublished = url{https://huggingface.co/datasets/imvladikon/hebrew_speech_campus}, }



