emilia-yodas-english-neucodec

Name: emilia-yodas-english-neucodec
Creator: maas
Published: 2025-12-05 11:38:04
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-08 收录

下载链接：

https://modelscope.cn/datasets/neuphonic/emilia-yodas-english-neucodec

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for NeuCodec Emilia-YODAS ## Dataset Description - **Repository:** [NeuCodec repository](https://github.com/neuphonic/neucodec) - **Paper:** Coming Soon ### Dataset Summary The NeuCodec Emilia-YODAS dataset is an English-language dataset containing >30M audio samples (>78k hours), taken from the English-language subset of Emilia-YODAS and compressed with [NeuCodec](https://huggingface.co/neuphonic/neucodec). # Usage ```python import torch from datasets import load_dataset from neucodec import NeuCodec # load dataset and model dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True) model = NeuCodec.from_pretrained("neuphonic/neucodec") model.eval() # reconstruct a sample fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :] print(f"FSQ codes shape: {fsq_codes.shape}") recon = model.decode_code(fsq_codes) print(f"Recon shape: {recon.shape}") ``` ## Dataset Structure ### Data Instances For each instance, the corresponding information from the Emilia-YODAS dataset is preserved. There is a column for id, dnsmos, duration, phone_count, speaker, text, and the codes. ``` {'id': 'EN_9eylmAUb-SQ_W000139', 'dnsmos': 3.0064, 'duration': 5.67, 'phone_count': 102, 'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00', 'text': 'In the Soviet Union during the 1920s, Yiddish...', 'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]} ``` Each parquet file contains approx 200 MB. There are 241 parquet files. Each code sequence is meant to be used with our NeuCodec decoder, which currently supports an output sampling rate of 24 kHz. ### Data Fields - `id`: a string containing the corresponding id from Emilia-YODAS - `dnsmos`: a float containing the DNSMOS score from Emilia-YODAS - `duration`: a float containing the sample duration from Emilia-YODAS - `phone_count`: an integer of the number of phones in a sample from Emilia-YODAS - `speaker`: a string containing the corresponding speaker from Emilia-YODAS - `text`: a string containing the utterance text - `codes`: a list containing the compressed audio segment as NeuCodec codes ## Dataset Creation ### Source Data The data was sourced from [Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) ## Additional Information ### Licensing Information The NeuCodec Emilia-YODAS dataset is released under the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en) license. ### Citation Information Coming soon

# NeuCodec Emilia-YODAS 数据集卡片 ## 数据集描述 - **代码仓库**：[NeuCodec 代码仓库](https://github.com/neuphonic/neucodec) - **论文**：即将发布 ### 数据集概览 NeuCodec Emilia-YODAS 数据集为英语数据集，源自Emilia-YODAS的英语子集，经[NeuCodec](https://huggingface.co/neuphonic/neucodec)压缩处理，包含超过3000万条音频样本（时长超7.8万小时）。 ## 使用方法 python import torch from datasets import load_dataset from neucodec import NeuCodec # 加载数据集与模型 dataset = load_dataset("neuphonic/emilia-yodas-english-neucodec", split="train", streaming=True) model = NeuCodec.from_pretrained("neuphonic/neucodec") model.eval() # 重建音频样本 fsq_codes = torch.tensor(next(iter(dataset))["codes"])[None, None, :] print(f"FSQ 编码形状: {fsq_codes.shape}") recon = model.decode_code(fsq_codes) print(f"重建音频形状: {recon.shape}") ## 数据集结构 ### 数据实例每个数据实例均保留Emilia-YODAS数据集的原始对应信息，包含id、dnsmos、duration、phone_count、speaker、text及codes共7个字段。 {'id': 'EN_9eylmAUb-SQ_W000139', 'dnsmos': 3.0064, 'duration': 5.67, 'phone_count': 102, 'speaker': 'EN_9eylmAUb-SQ_SPEAKER_00', 'text': 'In the Soviet Union during the 1920s, Yiddish...', 'codes': [3254, 49895, 26484, 869, 23077, 27555, 20391,...]} 每个Parquet文件大小约为200MB，总计包含241个Parquet文件。所有代码序列需搭配我们的NeuCodec解码器使用，当前解码器支持的输出采样率为24 kHz。 ### 数据字段 - `id`：字符串类型，存储Emilia-YODAS数据集对应的样本ID - `dnsmos`：浮点型，存储来自Emilia-YODAS的DNSMOS评分 - `duration`：浮点型，存储样本时长 - `phone_count`：整型，存储样本中的音素数量 - `speaker`：字符串类型，存储Emilia-YODAS数据集对应的说话人标识 - `text`：字符串类型，存储语音文本内容 - `codes`：列表类型，存储经NeuCodec压缩后的音频片段编码 ## 数据集构建 ### 源数据本数据集源自[Emilia-YODAS](https://huggingface.co/datasets/amphion/Emilia-Dataset) ## 附加信息 ### 授权信息本NeuCodec Emilia-YODAS数据集采用[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/deed.en)开源协议发布。 ### 引用信息即将发布

提供机构：

maas

创建时间：

2025-10-04

搜集汇总

数据集介绍