johnatanebonilla/coser
收藏Hugging Face2024-01-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/johnatanebonilla/coser
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: filename
dtype: string
- name: turno_id
dtype: int64
- name: turno_time
dtype: string
- name: sentence
dtype: string
- name: sentence_fono
dtype: string
- name: sentence_fono_sin_marcas
dtype: string
- name: sentence_orto
dtype: string
- name: sentence_orto_sin_marcas
dtype: string
- name: Provincia
dtype: string
- name: Enclave
dtype: string
- name: Fecha
dtype: string
- name: Duración
dtype: string
- name: Informantes
dtype: string
splits:
- name: train
num_bytes: 4600923777.433
num_examples: 53971
- name: validation
num_bytes: 503026194.46
num_examples: 6689
- name: test
num_bytes: 486076659.954
num_examples: 6726
download_size: 4707509912
dataset_size: 5590026631.847
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
task_categories:
- automatic-speech-recognition
- conversational
language:
- es
pretty_name: COSER-ASR Subset
size_categories:
- 10K<n<100K
---
# Introduction
The "COSER-ASR" Subset is a specialized extract from the "Corpus Oral y Sonoro del Español Rural" (COSER; Fernández-Ordóñez 2005-present), meaning the "Audible Corpus of Spoken Rural Spanish". This dataset has been specifically curated to facilitate the fine-tuning of Whisper, an automatic speech recognition system. For this purpose, audio and text segments ranging from 3 to 30 seconds have been automatically extracted from the COSER corpus. These segments provide concise and diverse samples of spoken rural Spanish, ideal for training and refining speech recognition models. To ensure manageability and efficient processing, a maximum of 1024 tokens were used in the dataset, striking a balance between comprehensive coverage and computational efficiency.
# Content and Demographic Focus
The original COSER dataset includes 218 transcriptions of semi-structured interviews primarily with elderly, less-educated individuals from rural Spain. These interviews, each averaging around 54 minutes, are rich in dialectal variations and linguistic nuances, offering valuable insights into traditional Spanish dialects.
# Transcription Approach
The "coser" dataset provides multiple layers of transcription to cater to different linguistic and computational needs:
### Original Transcription (sentence):
This is the direct transcription of the audio segments, preserving the original speech as closely as possible and the complete original transcription.
### Phonological Approximation (sentence_fono):
Here, the transcription is modified to reflect the phonological characteristics of the dialectal pronunciation. This version is crucial for understanding the phonetic nuances of rural Spanish dialects.
### Phonological Transcription without Discourse Markers (sentence_fono_sin_marcas):
This transcription removes discourse markers such as laughter, assent, etc., that are typically enclosed in square brackets. It offers a cleaner version focusing solely on the spoken words.
### Orthographic Correspondence (sentence_orto):
This layer provides the standard orthographic equivalent of the words transcribed phonologically. It bridges the gap between dialectal speech and standard Spanish orthography.
### Orthographic Transcription without Discourse Markers (sentence_orto_sin_marcas):
Similar to the phonological version without markers, this transcription provides a standard orthographic text devoid of any discourse markers. This is particularly useful for applications requiring clean text data.
# Limitations
Limitations of this model include the fact that the time intervals in the COSER corpus are not systematically aligned, meaning that there may not be a perfect one-to-one correspondence between the audio and text data.
# Additional Information and Resources
To explore more about the COSER corpus, its methodologies, and the full range of transcriptions, visit http://coser.lllf.uam.es/ and http://coser.lllf.uam.es/transcripcion.php. These resources provide an in-depth look at the COSER project, detailing its comprehensive approach to capturing the linguistic diversity of rural Spanish.
# References
Fernández-Ordóñez, I. (Ed.). (2005-present). Corpus Oral y Sonoro del Español Rural. Retrieved April 15, 2022, from http://www.corpusrural.es/
提供机构:
johnatanebonilla
原始信息汇总
数据集概述
数据集信息
-
特征列表:
audio: 音频数据filename: 文件名turno_id: 标识符turno_time: 时间sentence: 原始转录sentence_fono: 音系近似转录sentence_fono_sin_marcas: 无话语标记的音系转录sentence_orto: 正字法对应转录sentence_orto_sin_marcas: 无话语标记的正字法转录Provincia: 省份Enclave: 飞地Fecha: 日期Duración: 持续时间Informantes: 发音人
-
数据分割:
train: 训练集,包含 53971 个样本,大小为 4600923777.433 字节validation: 验证集,包含 6689 个样本,大小为 503026194.46 字节test: 测试集,包含 6726 个样本,大小为 486076659.954 字节
-
数据集大小:
- 下载大小: 4707509912 字节
- 数据集大小: 5590026631.847 字节
-
配置:
default:- 训练集路径:
data/train-* - 验证集路径:
data/validation-* - 测试集路径:
data/test-*
- 训练集路径:
-
任务类别:
- 自动语音识别
- 对话
-
语言:
- 西班牙语
-
数据集名称:
- COSER-ASR Subset
-
数据集规模:
- 10K<n<100K
内容和人口统计重点
原始COSER数据集包括218份半结构化访谈,主要对象是来自西班牙农村的老年、受教育程度较低的个体。这些访谈平均时长约54分钟,富含方言变体和语言细节,为传统西班牙方言提供了宝贵的见解。
转录方法
- 原始转录 (sentence): 直接转录音频片段,尽可能保留原始语音和完整原始转录。
- 音系近似转录 (sentence_fono): 转录修改以反映方言发音的音系特征,对于理解农村西班牙方言的音韵细节至关重要。
- 无话语标记的音系转录 (sentence_fono_sin_marcas): 去除话语标记(如笑声、同意等),提供仅关注口语的更清洁版本。
- 正字法对应转录 (sentence_orto): 提供音系转录的标准正字法等价物,弥合方言语音和标准西班牙语正字法之间的差距。
- 无话语标记的正字法转录 (sentence_orto_sin_marcas): 类似于无标记的音系转录,提供无话语标记的标准正字法文本,特别适用于需要清洁文本数据的应用。
局限性
该模型的局限性包括COSER语料库中的时间间隔未系统对齐,这意味着音频和文本数据之间可能不存在完美的一一对应关系。



