jimregan/clarinpl_studio

Name: jimregan/clarinpl_studio
Creator: jimregan
Published: 2023-01-21 12:27:08
License: 暂无描述

Hugging Face2023-01-21 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jimregan/clarinpl_studio

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language: - pl license: - other multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - other - automatic-speech-recognition task_ids: [] --- # Dataset Card for ClarinPL Studio Speech Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-instances) - [Data Splits](#data-instances) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [CLARIN-PL mowa](https://mowa.clarin-pl.eu/) - **Repository:** [Kaldi Baseline](https://github.com/danijel3/ClarinStudioKaldi) - **Paper:** [Polish Read Speech Corpus for Speech Tools and Services](https://arxiv.org/abs/1706.00245) - **Leaderboard:** [Paperswithcode Leaderboard][Needs More Information] - **Point of Contact:** [Danijel Koržinek](https://github.com/danijel3/) ### Dataset Summary The corpus consists of 317 speakers recorded in 554 sessions, where each session consists of 20 read sentences and 10 phonetically rich words. The size of the audio portion of the corpus amounts to around 56 hours, with transcriptions containing 356674 words from a vocabulary of size 46361. ### Supported Tasks and Leaderboards [Needs More Information] ### Languages The audio is in Polish. ## Dataset Structure ### Data Instances A typical data point comprises the path to the audio file, usually called `file` and its transcription, called `text`. An example from the dataset is: ``` {'file': '/root/.cache/huggingface/datasets/downloads/extracted/333ddc746f2df1e1d19b44986992d4cbe28710fde81d533a220e755ee6c5c519/audio/SES0001/rich001.wav', 'id': 'SES0001_rich001', 'speaker_id': 'SPK0001', 'text': 'drożdże dżip gwożdżenie ozimina wędzarz rdzeń wędzonka ingerować kładzenie jutrzenka'} ``` ### Data Fields - file: A path to the downloaded audio file in .wav format. - text: the transcription of the audio file. - speaker_id: The ID of the speaker of the audio. ### Data Splits | | Train | Test | Valid | | ----- | ----- | ---- | ----- | | dataset | 11222 | 1362 | 1229 | ## Dataset Creation ### Curation Rationale The purpose of this segment of the project was to develop specific tools that would allow for automatic and semi-automatic processing of large quantities of acoustic speech data. Another purpose of the corpus was to serve as a reference for studies in phonetics and pronunciation. ### Source Data #### Initial Data Collection and Normalization The corpus was recorded in a studio environment using two microphones: a high-quality studio microphone and a typical consumer audio headset. #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [CLARIN PUB+BY+INF+NORED](https://mowa.clarin-pl.eu/korpusy/LICENSE) ### Citation Information ``` @article{korvzinek2017polish, title={Polish read speech corpus for speech tools and services}, author={Kor{\v{z}}inek, Danijel and Marasek, Krzysztof and Brocki, {\L}ukasz and Wo{\l}k, Krzysztof}, journal={arXiv preprint arXiv:1706.00245}, year={2017} } ``` ### Contributions [Needs More Information]

提供机构：

jimregan

原始信息汇总

数据集概述

数据集描述

名称: ClarinPL Studio Speech Corpus
语言: 波兰语
许可: 其他
数据集大小: 10K<n<100K
任务类别: 自动语音识别

数据集总结

内容: 包含317位发言者的录音，共554个会话，每个会话包含20个朗读句子和10个语音丰富的单词。
音频时长: 约56小时
词汇量: 46361
转录单词数: 356674

数据集结构

数据实例: 每个数据点包括音频文件路径和转录文本。
数据字段:
- file: 音频文件路径（.wav格式）
- text: 音频文件的转录文本
- speaker_id: 发言者ID
数据分割:
- 训练集: 11222个实例
- 测试集: 1362个实例
- 验证集: 1229个实例

数据集创建

目的: 开发处理大量声学语音数据的工具，并作为语音学和发音研究的参考。
录音环境: 工作室环境，使用高质量工作室麦克风和普通消费级音频耳机。

许可信息

许可类型: CLARIN PUB+BY+INF+NORED

引用信息

@article{korvzinek2017polish, title={Polish read speech corpus for speech tools and services}, author={Kor{v{z}}inek, Danijel and Marasek, Krzysztof and Brocki, {L}ukasz and Wo{l}k, Krzysztof}, journal={arXiv preprint arXiv:1706.00245}, year={2017} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集