five

imvladikon/hebrew_speech_kan

收藏
Hugging Face2023-05-05 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/imvladikon/hebrew_speech_kan
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - automatic-speech-recognition language: - he size_categories: - 1K<n<10K dataset_info: features: - name: audio dtype: audio: sampling_rate: 16000 - name: sentence dtype: string splits: - name: train num_bytes: 1569850175.0 num_examples: 8000 - name: validation num_bytes: 394275049.0 num_examples: 2000 download_size: 1989406585 dataset_size: 1964125224.0 --- # Dataset Card for Dataset Name ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary Hebrew Dataset for ASR ### Supported Tasks and Leaderboards [More Information Needed] ### Languages [More Information Needed] ## Dataset Structure ### Data Instances ```json {'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/8ce7402f6482c6053251d7f3000eec88668c994beb48b7ca7352e77ef810a0b6/train/e429593fede945c185897e378a5839f4198.wav', 'array': array([-0.00265503, -0.0018158 , -0.00149536, ..., -0.00135803, -0.00231934, -0.00190735]), 'sampling_rate': 16000}, 'sentence': 'היא מבינה אותי יותר מכל אחד אחר'} ``` ### Data Fields [More Information Needed] ### Data Splits | | train | validation | | ---- | ----- | ---------- | | number of samples | 8000 | 2000 | | hours | 6.92 | 1.73 | ## Dataset Creation ### Curation Rationale scraped data from youtube (channel כאן) with removing outliers (by length and ratio between length of the audio and sentences) ### Source Data #### Initial Data Collection and Normalization #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ``` @misc{imvladikon2022hebrew_speech_kan, author = {Gurevich, Vladimir}, title = {Hebrew Speech Recognition Dataset: Kan}, year = {2022}, howpublished = \url{https://huggingface.co/datasets/imvladikon/hebrew_speech_kan}, } ``` ### Contributions [More Information Needed]
提供机构:
imvladikon
原始信息汇总

数据集概述

  • 任务类别: 自动语音识别
  • 语言: 希伯来语
  • 数据集大小: 1000 < n < 10000

数据集特征

  • 音频特征:
    • 名称: audio
    • 数据类型:
      • 采样率: 16000
  • 文本特征:
    • 名称: sentence
    • 数据类型: 字符串

数据集分割

训练集 验证集
样本数量 8000 2000
字节数 1569850175.0 394275049.0
下载大小 1989406585 1964125224.0

数据集创建

  • 数据来源: 从YouTube频道“כאן”抓取数据,移除了长度异常和音频与句子长度比例异常的数据。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作