five

demeleww/Amharic_tokens

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/demeleww/Amharic_tokens
下载链接
链接失效反馈
官方服务:
资源简介:
OmniVoice Amharic v4 — 预提取音频标记数据集,用于训练Amharic文本到语音(TTS)模型。该数据集包含了从多个来源收集的Amharic语音数据,并经过了预处理和过滤。数据集总共有81,731个样本,约331小时的音频,平均每个样本持续约14.6秒。数据集还包含了164个音频和文本分片,每个分片包含500个样本(最后一个分片包含231个样本)。数据集使用了HiggsAudioV2标记器进行音频标记提取,总大小约为648MB。数据集的应用过滤器包括持续时间(1-25秒)、文本长度(最大300个字符)、脚本(仅Geez脚本)和方言(仅Addis Ababa方言)。

OmniVoice Amharic v4 — Pre-extracted Audio Tokens dataset for training Amharic Text-to-Speech (TTS) models. The dataset contains Amharic speech data collected from multiple sources, which has been preprocessed and filtered. The dataset includes a total of 81,731 samples, approximately 331 hours of audio, with an average sample duration of about 14.6 seconds. The dataset also contains 164 audio and text shards, each containing 500 samples (the last shard contains 231 samples). The dataset uses the HiggsAudioV2 tokenizer for audio token extraction, with a total size of approximately 648MB. The applied filters include duration (1-25 seconds), text length (max 300 characters), script (Geez script only), and dialect (Addis Ababa dialect only).
提供机构:
demeleww
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作