LEMAS-Project/LEMAS-Dataset-train
收藏Hugging Face2026-03-31 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/LEMAS-Project/LEMAS-Dataset-train
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是LEMAS-Project的一部分,包含一个大规模的训练集(超过15万小时)和一个精心策划的评估集(每种语言500个话语),涵盖10种语言,所有数据均具有词级对齐。训练集通过语言和数据集特定的约束条件过滤大规模对齐的音频-文本对构建而成,评估集则通过过滤、修剪和排名对齐的音频-文本对构建。数据集支持的语言包括意大利语、葡萄牙语、西班牙语、法语、德语、越南语、印尼语、俄语、英语和中文。
This dataset is part of the LEMAS-Project, containing a large-scale training set (150k+ hours) and a curated evaluation set (500 utterances per language) covering 10 languages, all with word-level alignment. The training set is constructed by filtering large-scale aligned audio–text pairs with language- and dataset-specific constraints, while the eval set is built by filtering, trimming, and ranking aligned audio–text pairs. The supported languages include Italian, Portuguese, Spanish, French, German, Vietnamese, Indonesian, Russian, English, and Chinese.
提供机构:
LEMAS-Project



