sophiayk20/covoswitch
收藏Hugging Face2024-07-18 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/sophiayk20/covoswitch
下载链接
链接失效反馈官方服务:
资源简介:
CoVoSwitch是一个代码切换文本数据集,通过检测和替换CoVoST 2语音到文本翻译数据集的语调单元,使用PSST预训练语音分割模型(基于Whisper微调)生成。该数据集的相关论文将出现在2024年ACL学生研究研讨会上。
The dataset consists of multiple configurations, each representing a different language pair with English. Each configuration includes features such as id and translation, where translation contains multiple language fields. The dataset is divided into train, validation, and test splits, each with specified number of examples and bytes. The dataset is designed for translation tasks between various languages and English.
提供机构:
sophiayk20
原始信息汇总
数据集概述
数据集配置
ar_en
- 特征:
id: int64translation:ar: stringcsw: stringen: string
- 分割:
train: 39873587 bytes, 145115 examplesvalidation: 1857210 bytes, 6784 examplestest: 1319939 bytes, 5176 examples
- 下载大小: 21799684 bytes
- 数据集大小: 43050736 bytes
ca_en
- 特征:
id: int64translation:ca: stringcsw: stringen: string
- 分割:
train: 33611880 bytes, 143880 examplesvalidation: 1571889 bytes, 6717 examplestest: 1114497 bytes, 5137 examples
- 下载大小: 19688583 bytes
- 数据集大小: 36298266 bytes
cy_en
- 特征:
id: int64translation:csw: stringcy: stringen: string
- 分割:
train: 32651343 bytes, 143473 examplesvalidation: 1523599 bytes, 6684 examplestest: 1105227 bytes, 5150 examples
- 下载大小: 19373396 bytes
- 数据集大小: 35280169 bytes
de_en
- 特征:
id: int64translation:csw: stringde: stringen: string
- 分割:
train: 34695308 bytes, 143851 examplesvalidation: 1621323 bytes, 6711 examplestest: 1164556 bytes, 5138 examples
- 下载大小: 20392347 bytes
- 数据集大小: 37481187 bytes
et_en
- 特征:
id: int64translation:csw: stringen: stringet: string
- 分割:
train: 32303652 bytes, 144239 examplesvalidation: 1513275 bytes, 6735 examplestest: 1081292 bytes, 5153 examples
- 下载大小: 19640808 bytes
- 数据集大小: 34898219 bytes
fa_en
- 特征:
id: int64translation:csw: stringen: stringfa: string
- 分割:
train: 41689266 bytes, 145605 examplesvalidation: 1926004 bytes, 6786 examplestest: 1391495 bytes, 5174 examples
- 下载大小: 21504177 bytes
- 数据集大小: 45006765 bytes
id_en
- 特征:
id: int64translation:csw: stringen: stringid: string
- 分割:
train: 33148671 bytes, 143277 examplesvalidation: 1539978 bytes, 6659 examplestest: 1120224 bytes, 5128 examples
- 下载大小: 19067544 bytes
- 数据集大小: 35808873 bytes
lv_en
- 特征:
id: int64translation:csw: stringen: stringlv: string
- 分割:
train: 33883903 bytes, 145320 examplesvalidation: 1580406 bytes, 6774 examplestest: 1132431 bytes, 5176 examples
- 下载大小: 20373539 bytes
- 数据集大小: 36596740 bytes
mn_en
- 特征:
id: int64translation:csw: stringen: stringmn: string
- 分割:
train: 45451036 bytes, 145154 examplesvalidation: 2127066 bytes, 6772 examplestest: 1498064 bytes, 5152 examples
- 下载大小: 22854954 bytes
- 数据集大小: 49076166 bytes
sl_en
- 特征:
id: int64translation:csw: stringen: stringsl: string
- 分割:
train: 32208205 bytes, 144361 examplesvalidation: 1515338 bytes, 6737 examplestest: 1071546 bytes, 5158 examples
- 下载大小: 19634212 bytes
- 数据集大小: 34795089 bytes
sv_en
- 特征:
id: int64translation:csw: stringen: stringsv: string
- 分割:
train: 32549473 bytes, 143235 examplesvalidation: 1513931 bytes, 6670 examplestest: 1029075 bytes, 4813 examples
- 下载大小: 19247807 bytes
- 数据集大小: 35092479 bytes
ta_en
- 特征:
id: int64translation:csw: stringen: stringta: string
- 分割:
train: 67154406 bytes, 145227 examplesvalidation: 3173694 bytes, 6790 examplestest: 2243718 bytes, 5161 examples
- 下载大小: 26478753 bytes
- 数据集大小: 72571818 bytes
tr_en
- 特征:
id: int64translation:csw: stringen: stringtr: string
- 分割:
train: 33853623 bytes, 144543 examplesvalidation: 1586279 bytes, 6739 examplestest: 1127637 bytes, 5154 examples
- 下载大小: 19987244 bytes
- 数据集大小: 36567539 bytes
数据文件路径
ar_en
train: ar_en/train-*validation: ar_en/validation-*test: ar_en/test-*
ca_en
train: ca_en/train-*validation: ca_en/validation-*test: ca_en/test-*
cy_en
train: cy_en/train-*validation: cy_en/validation-*test: cy_en/test-*
de_en
train: de_en/train-*validation: de_en/validation-*test: de_en/test-*
et_en
train: et_en/train-*validation: et_en/validation-*test: et_en/test-*
fa_en
train: fa_en/train-*validation: fa_en/validation-*test: fa_en/test-*
id_en
train: id_en/train-*validation: id_en/validation-*test: id_en/test-*
lv_en
train: lv_en/train-*validation: lv_en/validation-*test: lv_en/test-*
mn_en
train: mn_en/train-*validation: mn_en/validation-*test: mn_en/test-*
sl_en
train: sl_en/train-*validation: sl_en/validation-*test: sl_en/test-*
sv_en
train: sv_en/train-*validation: sv_en/validation-*test: sv_en/test-*
ta_en
train: ta_en/train-*validation: ta_en/validation-*test: ta_en/test-*
tr_en
train: tr_en/train-*validation: tr_en/validation-*test: tr_en/test-*
语言
- ar
- ca
- cy
- de
- et
- fa
- id
- lv
- mn
- sl
- sv
- ta
- tr



