yodas2
收藏魔搭社区2025-12-27 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/pengzhendong/yodas2
下载链接
链接失效反馈官方服务:
资源简介:
YODAS2 is the long-form dataset from YODAS dataset.
It provides the same dataset as [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) but YODAS2 has the following new features:
- formatted in the long-form (video-level) where audios are not segmented.
- audios are encoded using higher sampling rates (i.e. 24k)
For detailed information about YODAS dataset, please refer to [our paper](https://arxiv.org/abs/2406.00899) and the [espnet/yodas repo](https://huggingface.co/datasets/espnet/yodas).
## Usage:
Each data point corresponds to an entire video on YouTube, it contains the following fields:
- video_id: unique id of this video (note this id is not the video_id in Youtube)
- duration: total duration in seconds of this video
- audio
- path: local path to wav file if in standard mode, otherwise empty in the streaming mode
- sampling_rate: fixed to be 24k. (note that the sampling rate in `espnet/yodas` is 16k)
- array: wav samples in float
- utterances
- utt_id: unique id of this utterance
- text: transcription of this utterance
- start: start timestamp in seconds of this utterance
- end: end timestamp in seconds of this utterance
YODAS2 also supports two modes:
**standard mode**: each subset will be downloaded to the local dish before first iterating.
```python
from datasets import load_dataset
# Note this will take very long time to download and preprocess
# you can try small subset for testing purpose
ds = load_dataset('espnet/yodas2', 'en000')
print(next(iter(ds['train'])))
```
**streaming mode** most of the files will be streamed instead of downloaded to your local deivce. It can be used to inspect this dataset quickly.
```python
from datasets import load_dataset
# this streaming loading will finish quickly
ds = load_dataset('espnet/yodas2', 'en000', streaming=True)
```
## Reference
```
@inproceedings{li2023yodas,
title={Yodas: Youtube-Oriented Dataset for Audio and Speech},
author={Li, Xinjian and Takamichi, Shinnosuke and Saeki, Takaaki and Chen, William and Shiota, Sayaka and Watanabe, Shinji},
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
pages={1--8},
year={2023},
organization={IEEE}
}
```
## Contact
If you have any questions, feel free to contact us at the following email address.
We made sure that our dataset only consisted of videos with CC licenses during our downloading. But in case you find your video unintentionally included in our dataset and would like to delete it, you can send a delete request to the following email.
Remove the parenthesis `()` from the following email address
`(lixinjian)(1217)@gmail.com`
YODAS2 是 YODAS 数据集的长格式版本数据集。其与 [espnet/yodas](https://huggingface.co/datasets/espnet/yodas) 数据集内容完全一致,但 YODAS2 新增了以下特性:
- 采用视频级长格式存储,音频未经过分段处理;
- 音频采用更高采样率(即 24kHz)进行编码。
如需了解 YODAS 数据集的详细信息,请参阅 [我们的论文](https://arxiv.org/abs/2406.00899) 以及 [espnet/yodas 仓库](https://huggingface.co/datasets/espnet/yodas)。
## 使用方法
每个数据点对应 YouTube 上的一段完整视频,包含以下字段:
- video_id:该视频的唯一标识符(注意:此 ID 并非 YouTube 原生视频 ID)
- duration:该视频的总时长,单位为秒
- audio
- path:标准模式下为 WAV 文件的本地路径,流式模式下为空
- sampling_rate:固定为 24kHz(注:`espnet/yodas` 中的采样率为 16kHz)
- array:浮点格式的 WAV 采样数据
- utterances
- utt_id:该话语片段的唯一标识符
- text:该话语片段的转写文本
- start:该话语片段的起始时间戳,单位为秒
- end:该话语片段的结束时间戳,单位为秒
YODAS2 支持两种使用模式:
**标准模式**:首次遍历数据集前,各子集将下载至本地磁盘。
python
from datasets import load_dataset
# 注意:下载与预处理将耗费较长时间,您可尝试小尺寸子集用于测试
ds = load_dataset('espnet/yodas2', 'en000')
print(next(iter(ds['train'])))
**流式模式**:多数文件将通过流式加载而非下载至本地设备,可用于快速浏览该数据集。
python
from datasets import load_dataset
# 该流式加载将快速完成
ds = load_dataset('espnet/yodas2', 'en000', streaming=True)
## 参考文献
@inproceedings{li2023yodas,
title={Yodas: 面向音频与语音的YouTube专用数据集},
author={Li, Xinjian and Takamichi, Shinnosuke and Saeki, Takaaki and Chen, William and Shiota, Sayaka and Watanabe, Shinji},
booktitle={2023 IEEE自动语音识别与理解研讨会(ASRU)},
pages={1--8},
year={2023},
organization={IEEE}
}
## 联系方式
若您有任何疑问,可通过以下邮箱联系我们。
我们在下载过程中已确保数据集仅包含采用知识共享(CC)许可的视频,但倘若您发现自己的视频意外被纳入本数据集并希望将其移除,请发送删除请求至以下邮箱。
请从以下邮箱地址中移除括号 `()`:`(lixinjian)(1217)@gmail.com`,移除后的有效邮箱为 `lixinjian1217@gmail.com`。
提供机构:
maas
创建时间:
2025-03-12



