johnatanebonilla/coser

Name: johnatanebonilla/coser
Creator: johnatanebonilla
Published: 2024-01-03 12:15:06
License: 暂无描述

Hugging Face2024-01-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/johnatanebonilla/coser

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio - name: filename dtype: string - name: turno_id dtype: int64 - name: turno_time dtype: string - name: sentence dtype: string - name: sentence_fono dtype: string - name: sentence_fono_sin_marcas dtype: string - name: sentence_orto dtype: string - name: sentence_orto_sin_marcas dtype: string - name: Provincia dtype: string - name: Enclave dtype: string - name: Fecha dtype: string - name: Duración dtype: string - name: Informantes dtype: string splits: - name: train num_bytes: 4600923777.433 num_examples: 53971 - name: validation num_bytes: 503026194.46 num_examples: 6689 - name: test num_bytes: 486076659.954 num_examples: 6726 download_size: 4707509912 dataset_size: 5590026631.847 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* task_categories: - automatic-speech-recognition - conversational language: - es pretty_name: COSER-ASR Subset size_categories: - 10K<n<100K --- # Introduction The "COSER-ASR" Subset is a specialized extract from the "Corpus Oral y Sonoro del Español Rural" (COSER; Fernández-Ordóñez 2005-present), meaning the "Audible Corpus of Spoken Rural Spanish". This dataset has been specifically curated to facilitate the fine-tuning of Whisper, an automatic speech recognition system. For this purpose, audio and text segments ranging from 3 to 30 seconds have been automatically extracted from the COSER corpus. These segments provide concise and diverse samples of spoken rural Spanish, ideal for training and refining speech recognition models. To ensure manageability and efficient processing, a maximum of 1024 tokens were used in the dataset, striking a balance between comprehensive coverage and computational efficiency. # Content and Demographic Focus The original COSER dataset includes 218 transcriptions of semi-structured interviews primarily with elderly, less-educated individuals from rural Spain. These interviews, each averaging around 54 minutes, are rich in dialectal variations and linguistic nuances, offering valuable insights into traditional Spanish dialects. # Transcription Approach The "coser" dataset provides multiple layers of transcription to cater to different linguistic and computational needs: ### Original Transcription (sentence): This is the direct transcription of the audio segments, preserving the original speech as closely as possible and the complete original transcription. ### Phonological Approximation (sentence_fono): Here, the transcription is modified to reflect the phonological characteristics of the dialectal pronunciation. This version is crucial for understanding the phonetic nuances of rural Spanish dialects. ### Phonological Transcription without Discourse Markers (sentence_fono_sin_marcas): This transcription removes discourse markers such as laughter, assent, etc., that are typically enclosed in square brackets. It offers a cleaner version focusing solely on the spoken words. ### Orthographic Correspondence (sentence_orto): This layer provides the standard orthographic equivalent of the words transcribed phonologically. It bridges the gap between dialectal speech and standard Spanish orthography. ### Orthographic Transcription without Discourse Markers (sentence_orto_sin_marcas): Similar to the phonological version without markers, this transcription provides a standard orthographic text devoid of any discourse markers. This is particularly useful for applications requiring clean text data. # Limitations Limitations of this model include the fact that the time intervals in the COSER corpus are not systematically aligned, meaning that there may not be a perfect one-to-one correspondence between the audio and text data. # Additional Information and Resources To explore more about the COSER corpus, its methodologies, and the full range of transcriptions, visit http://coser.lllf.uam.es/ and http://coser.lllf.uam.es/transcripcion.php. These resources provide an in-depth look at the COSER project, detailing its comprehensive approach to capturing the linguistic diversity of rural Spanish. # References Fernández-Ordóñez, I. (Ed.). (2005-present). Corpus Oral y Sonoro del Español Rural. Retrieved April 15, 2022, from http://www.corpusrural.es/

提供机构：

johnatanebonilla

原始信息汇总

数据集概述

数据集信息

特征列表:
- audio: 音频数据
- filename: 文件名
- turno_id: 标识符
- turno_time: 时间
- sentence: 原始转录
- sentence_fono: 音系近似转录
- sentence_fono_sin_marcas: 无话语标记的音系转录
- sentence_orto: 正字法对应转录
- sentence_orto_sin_marcas: 无话语标记的正字法转录
- Provincia: 省份
- Enclave: 飞地
- Fecha: 日期
- Duración: 持续时间
- Informantes: 发音人
数据分割:
- train: 训练集，包含 53971 个样本，大小为 4600923777.433 字节
- validation: 验证集，包含 6689 个样本，大小为 503026194.46 字节
- test: 测试集，包含 6726 个样本，大小为 486076659.954 字节
数据集大小:
- 下载大小: 4707509912 字节
- 数据集大小: 5590026631.847 字节
配置:
- default:
  - 训练集路径: data/train-*
  - 验证集路径: data/validation-*
  - 测试集路径: data/test-*
任务类别:
- 自动语音识别
- 对话
语言:
- 西班牙语
数据集名称:
- COSER-ASR Subset
数据集规模:
- 10K<n<100K

内容和人口统计重点

原始COSER数据集包括218份半结构化访谈，主要对象是来自西班牙农村的老年、受教育程度较低的个体。这些访谈平均时长约54分钟，富含方言变体和语言细节，为传统西班牙方言提供了宝贵的见解。

转录方法

原始转录 (sentence): 直接转录音频片段，尽可能保留原始语音和完整原始转录。
音系近似转录 (sentence_fono): 转录修改以反映方言发音的音系特征，对于理解农村西班牙方言的音韵细节至关重要。
无话语标记的音系转录 (sentence_fono_sin_marcas): 去除话语标记（如笑声、同意等），提供仅关注口语的更清洁版本。
正字法对应转录 (sentence_orto): 提供音系转录的标准正字法等价物，弥合方言语音和标准西班牙语正字法之间的差距。
无话语标记的正字法转录 (sentence_orto_sin_marcas): 类似于无标记的音系转录，提供无话语标记的标准正字法文本，特别适用于需要清洁文本数据的应用。

局限性

该模型的局限性包括COSER语料库中的时间间隔未系统对齐，这意味着音频和文本数据之间可能不存在完美的一一对应关系。

5,000+

优质数据集

54 个

任务类型

进入经典数据集