five

imvladikon/hebrew_speech_campus

收藏
Hugging Face2023-11-20 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/imvladikon/hebrew_speech_campus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - he size_categories: - 10K<n<100K task_categories: - automatic-speech-recognition dataset_info: features: - name: uid dtype: string - name: file_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: sentence dtype: string - name: n_segment dtype: int32 - name: duration_ms dtype: float32 - name: language dtype: string - name: sample_rate dtype: int32 - name: course dtype: string - name: sentence_length dtype: int32 - name: n_tokens dtype: int32 splits: - name: train num_bytes: 17559119499.576 num_examples: 75924 download_size: 17274739665 dataset_size: 17559119499.576 configs: - config_name: default data_files: - split: train path: data/train-* --- ## Data Description Hebrew Speech Recognition dataset from [Campus IL](https://campus.gov.il/). Data was scraped from the Campus website, which contains video lectures from various courses in Hebrew. Then subtitles were extracted from the videos and aligned with the audio. Subtitles that are not on Hebrew were removed (WIP: need to remove non-Hebrew audio as well, e.g. using simple classifier). Samples with duration less than 3 second were removed. Total duration of the dataset is 152 hours. Outliers in terms of the duration/char ratio were not removed, so it's possible to find suspiciously long or short sentences compared to the duration. Note: if loading is slow, just clone it : `git clone hebrew_speech_campus && cd hebrew_speech_campus && git lfs pull` and load it from the folder `load_dataset("./hebrew_speech_campus")` ## Data Format Audio files are in WAV format, 16kHz sampling rate, 16bit, mono. Ignore `path` field, use `audio.array` field value. ## Data Usage ```python from datasets import load_dataset ds = load_dataset("imvladikon/hebrew_speech_campus", split="train", streaming=True) print(next(iter(ds))) ``` ## Data Sample ``` {'uid': '10c3eda27cf173ab25bde755d0023abed301fcfd', 'file_id': '10c3eda27cf173ab25bde755d0023abed301fcfd_13', 'audio': {'path': '/content/hebrew_speech_campus/data/from_another_angle-_mathematics_teaching_practices/10c3eda27cf173ab25bde755d0023abed301fcfd_13.wav', 'array': array([ 5.54326562e-07, 3.60812592e-05, -2.35188054e-04, ..., 2.34067178e-04, 1.55649337e-04, 6.32447700e-05]), 'sampling_rate': 16000}, 'sentence': 'הדוברים צריכים לקחת עליו אחריות, ולהיות מחויבים לו כלומר, השיח צריך להיות מחויב', 'n_segment': 13, 'duration_ms': 6607.98193359375, 'language': 'he', 'sample_rate': 16000, 'course': 'from_another_angle-_mathematics_teaching_practices', 'sentence_length': 79, 'n_tokens': 13} ``` ## Data Splits and Stats Split: train Number of samples: 75924 ## Citation Please cite the following if you use this dataset in your work: ``` @misc{imvladikon2023hebrew_speech_campus, author = {Gurevich, Vladimir}, title = {Hebrew Speech Recognition Dataset: Campus}, year = {2023}, howpublished = \url{https://huggingface.co/datasets/imvladikon/hebrew_speech_campus}, } ```
提供机构:
imvladikon
原始信息汇总

数据集概述

语言

  • 希伯来语(he)

数据规模

  • 数据量:10K<n<100K

任务类别

  • 自动语音识别(automatic-speech-recognition)

数据集信息

  • 特征字段

    • uid: 字符串类型
    • file_id: 字符串类型
    • audio: 音频类型,采样率16000
    • sentence: 字符串类型
    • n_segment: 整数类型
    • duration_ms: 浮点数类型
    • language: 字符串类型
    • sample_rate: 整数类型
    • course: 字符串类型
    • sentence_length: 整数类型
    • n_tokens: 整数类型
  • 数据分割

    • train: 训练集,包含75924个样本,总字节数为17559119499.576

数据格式

  • 音频文件格式:WAV
  • 采样率:16kHz
  • 位深度:16bit
  • 声道:单声道

数据样本

  • 示例数据包含以下字段:
    • uid: 样本唯一标识
    • file_id: 文件标识
    • audio: 音频数据,包含路径、数组和采样率
    • sentence: 对应的句子文本
    • n_segment: 段落编号
    • duration_ms: 持续时间(毫秒)
    • language: 语言标识
    • sample_rate: 采样率
    • course: 课程名称
    • sentence_length: 句子长度
    • n_tokens: 句子中的词数

数据分割和统计

  • 训练集(train):包含75924个样本

引用

  • 使用此数据集时,请引用以下信息:

    @misc{imvladikon2023hebrew_speech_campus, author = {Gurevich, Vladimir}, title = {Hebrew Speech Recognition Dataset: Campus}, year = {2023}, howpublished = url{https://huggingface.co/datasets/imvladikon/hebrew_speech_campus}, }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作