five

duplexio/emilia-yodas-en-aligned

收藏
Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/duplexio/emilia-yodas-en-aligned
下载链接
链接失效反馈
官方服务:
资源简介:
Emilia-YODAS EN Word-Aligned是一个英语单词级别强制对齐时间戳的数据集,基于amphion/Emilia-Dataset的英语子集(Emilia-YODAS分割),使用Qwen/Qwen3-ForcedAligner-0.6B生成。数据集不包含音频,仅包含元数据(ID、转录文本和每个单词的[start, end]时间戳)。数据集包含4,516,833个话语,总音频时长为11,572.7小时,平均时长为9.22秒,单词总数为114,960,350个,平均每个话语包含25.5个单词。数据集的每个条目是一个JSON对象,包含id、language、text、duration和words字段。数据集的使用需要与原始Emilia-YODAS音频结合,通过id字段进行匹配。

Emilia-YODAS EN Word-Aligned is a word-level forced-alignment timestamps dataset for the English subset of amphion/Emilia-Dataset (Emilia-YODAS split), produced with Qwen/Qwen3-ForcedAligner-0.6B. No audio is redistributed — this dataset contains only metadata (IDs, transcripts already present in Emilia-YODAS, and per-word [start, end] timestamps). The dataset includes 4,516,833 utterances, totaling 11,572.7 hours of audio, with a mean duration of 9.22 seconds, 114,960,350 words, and a mean of 25.5 words per utterance. Each entry is a JSON object containing id, language, text, duration, and words fields. To use the dataset, it must be paired with the original Emilia-YODAS audio by matching the id field.
提供机构:
duplexio
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作