five

emilia-yodas-english-neucodec

收藏
魔搭社区2025-12-05 更新2025-11-08 收录
下载链接:
https://modelscope.cn/datasets/neuphonic/emilia-yodas-english-neucodec
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for NeuCodec Emilia-YODAS ## Dataset Description - **Repository:** [NeuCodec repository](https://github.com/neuphonic/neucodec) - **Paper:** Coming Soon ### Dataset Summary The NeuCodec Emilia-YODAS dataset is an English-language dataset containing >30M audio samples (>78k hours), taken from the English-language subset of Emilia-YODAS and compressed with [NeuCodec](https://huggingface.co/neuphonic/neucodec). # Usage ```python import torch from datasets import load_dataset from neucodec import NeuCodec # load dataset and model dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True) model = NeuCodec.from_pretrained("neuphonic/neucodec") model.eval() # reconstruct a sample fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :] print(f"FSQ codes shape: {fsq_codes.shape}") recon = model.decode_code(fsq_codes) print(f"Recon shape: {recon.shape}") ``` ## Dataset Structure ### Data Instances For each instance, the corresponding information from the Emilia-YODAS dataset is preserved. There is a column for id, dnsmos, duration, phone_count, speaker, text, and the codes. ``` {'id': 'EN_9eylmAUb-SQ_W000139', 'dnsmos': 3.0064, 'duration': 5.67, 'phone_count': 102, 'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00', 'text': 'In the Soviet Union during the 1920s, Yiddish...', 'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]} ``` Each parquet file contains approx 200 MB. There are 241 parquet files. Each code sequence is meant to be used with our NeuCodec decoder, which currently supports an output sampling rate of 24 kHz. ### Data Fields - `id`: a string containing the corresponding id from Emilia-YODAS - `dnsmos`: a float containing the DNSMOS score from Emilia-YODAS - `duration`: a float containing the sample duration from Emilia-YODAS - `phone_count`: an integer of the number of phones in a sample from Emilia-YODAS - `speaker`: a string containing the corresponding speaker from Emilia-YODAS - `text`: a string containing the utterance text - `codes`: a list containing the compressed audio segment as NeuCodec codes ## Dataset Creation ### Source Data The data was sourced from [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) ## Additional Information ### Licensing Information The NeuCodec Emilia-YODAS dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en) license. ### Citation Information Coming soon

# NeuCodec Emilia-YODAS 数据集卡片 ## 数据集描述 - **代码仓库**:[NeuCodec 代码仓库](https://github.com/neuphonic/neucodec) - **论文**:即将发布 ### 数据集概览 NeuCodec Emilia-YODAS 数据集为英语数据集,源自Emilia-YODAS的英语子集,经[NeuCodec](https://huggingface.co/neuphonic/neucodec)压缩处理,包含超过3000万条音频样本(时长超7.8万小时)。 ## 使用方法 python import torch from datasets import load_dataset from neucodec import NeuCodec # 加载数据集与模型 dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True) model = NeuCodec.from_pretrained("neuphonic/neucodec") model.eval() # 重建音频样本 fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :] print(f"FSQ 编码形状: {fsq_codes.shape}") recon = model.decode_code(fsq_codes) print(f"重建音频形状: {recon.shape}") ## 数据集结构 ### 数据实例 每个数据实例均保留Emilia-YODAS数据集的原始对应信息,包含id、dnsmos、duration、phone_count、speaker、text及codes共7个字段。 {'id': 'EN_9eylmAUb-SQ_W000139', 'dnsmos': 3.0064, 'duration': 5.67, 'phone_count': 102, 'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00', 'text': 'In the Soviet Union during the 1920s, Yiddish...', 'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]} 每个Parquet文件大小约为200MB,总计包含241个Parquet文件。所有代码序列需搭配我们的NeuCodec解码器使用,当前解码器支持的输出采样率为24 kHz。 ### 数据字段 - `id`:字符串类型,存储Emilia-YODAS数据集对应的样本ID - `dnsmos`:浮点型,存储来自Emilia-YODAS的DNSMOS评分 - `duration`:浮点型,存储样本时长 - `phone_count`:整型,存储样本中的音素数量 - `speaker`:字符串类型,存储Emilia-YODAS数据集对应的说话人标识 - `text`:字符串类型,存储语音文本内容 - `codes`:列表类型,存储经NeuCodec压缩后的音频片段编码 ## 数据集构建 ### 源数据 本数据集源自[Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) ## 附加信息 ### 授权信息 本NeuCodec Emilia-YODAS数据集采用[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)开源协议发布。 ### 引用信息 即将发布
提供机构:
maas
创建时间:
2025-10-04
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是Emilia-YODAS英文子集的NeuCodec压缩版本,包含超过3000万个音频样本(约78,000小时),数据字段包括音频标识、质量评分、时长、音素数量、说话人、文本和压缩编码。数据集采用CC-BY-4.0许可证发布。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作