Fhrozen/CABankSakuraCHJP
收藏Hugging Face2022-12-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Fhrozen/CABankSakuraCHJP
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- crowdsourced
- expert-generated
language:
- ja
license:
- cc
multilinguality:
- monolingual
size_categories:
- 100K<n<1M
source_datasets:
- found
task_categories:
- audio-classification
- automatic-speech-recognition
task_ids:
- speaker-identification
pretty_name: banksakura
tags:
- speech-recognition
---
# CABank Japanese CallHome Corpus
- Participants: 120
- Type of Study: phone call
- Location: United States
- Media type: audio
- DOI: doi:10.21415/T5H59V
- Web: https://ca.talkbank.org/access/CallHome/jpn.html
## Citation information
Some citation here.
In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.
## Project Description
This is the Japanese portion of CallHome.
Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call.
Although the goal of the call collection effort was to have unique speakers in all calls, a handful of repeat speakers are included in the corpus. In all, 200 calls were transcribed. Of these, 80 have been designated as training calls, 20 as development test calls, and 100 as evaluation test calls. For each of the training and development test calls, a contiguous 10-minute region was selected for transcription; for the evaluation test calls, a 5-minute region was transcribed. For the present publication, only 20 of the evaluation test calls are being released; the remaining 80 test calls are being held in reserve for future LVCSR benchmark tests.
After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes.
## Acknowledgements
Andrew Yankes reformatted this corpus into accord with current versions of CHAT.
提供机构:
Fhrozen
原始信息汇总
数据集概述
基本信息
- 名称: CABank Japanese CallHome Corpus
- 语言: 日语 (ja)
- 许可证: 知识共享 (cc)
- 多语言性: 单语种
- 大小: 100K<n<1M
数据来源与任务
- 来源: 发现 (found)
- 任务类型:
- 音频分类
- 自动语音识别
- 具体任务: 说话人识别 (speaker-identification)
数据集创建
- 标注创建者: 专家生成
- 语言创建者:
- 众包
- 专家生成
数据集详情
- 参与者: 120人
- 研究类型: 电话通话
- 媒体类型: 音频
- 录音方式: 通过LDC提供的免费机器人接线员进行,使用个人识别码(PIN)确保唯一通话。
- 通话时长: 最多30分钟
- 通话报酬: 完成通话后支付$20
- 通话限制: 每人仅限一次通话
- 数据划分:
- 训练集: 80个通话
- 开发测试集: 20个通话
- 评估测试集: 100个通话(当前发布20个,其余保留)
数据处理
- 录音审核: 通话完成后进行人工审核,确保语言正确、录音质量良好,并选择转录区域。
- 转录描述: 提供通道质量、说话人数、性别等信息。
贡献者
- 格式化: Andrew Yankes



