duplexio/emilia-yodas-en-aligned

Name: duplexio/emilia-yodas-en-aligned
Creator: duplexio
Published: 2026-04-22 19:07:09
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/duplexio/emilia-yodas-en-aligned

下载链接

链接失效反馈

官方服务：

资源简介：

Emilia-YODAS EN Word-Aligned是一个英语单词级别强制对齐时间戳的数据集，基于amphion/Emilia-Dataset的英语子集（Emilia-YODAS分割），使用Qwen/Qwen3-ForcedAligner-0.6B生成。数据集不包含音频，仅包含元数据（ID、转录文本和每个单词的[start, end]时间戳）。数据集包含4,516,833个话语，总音频时长为11,572.7小时，平均时长为9.22秒，单词总数为114,960,350个，平均每个话语包含25.5个单词。数据集的每个条目是一个JSON对象，包含id、language、text、duration和words字段。数据集的使用需要与原始Emilia-YODAS音频结合，通过id字段进行匹配。

Emilia-YODAS EN Word-Aligned is a word-level forced-alignment timestamps dataset for the English subset of amphion/Emilia-Dataset (Emilia-YODAS split), produced with Qwen/Qwen3-ForcedAligner-0.6B. No audio is redistributed — this dataset contains only metadata (IDs, transcripts already present in Emilia-YODAS, and per-word [start, end] timestamps). The dataset includes 4,516,833 utterances, totaling 11,572.7 hours of audio, with a mean duration of 9.22 seconds, 114,960,350 words, and a mean of 25.5 words per utterance. Each entry is a JSON object containing id, language, text, duration, and words fields. To use the dataset, it must be paired with the original Emilia-YODAS audio by matching the id field.

提供机构：

duplexio

5,000+

优质数据集

54 个

任务类型

进入经典数据集