deepdml/commonvoice-neucodec
收藏Hugging Face2026-03-19 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/deepdml/commonvoice-neucodec
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: ar
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: language
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 30460953
num_examples: 28369
- name: validation
num_bytes: 11730231
num_examples: 10470
- name: test
num_bytes: 11562939
num_examples: 10480
download_size: 34381352
dataset_size: 53754123
- config_name: de
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: language
dtype: string
- name: client_id
dtype: string
splits:
- name: validation
num_bytes: 23713417
num_examples: 16183
- name: test
num_bytes: 23727723
num_examples: 16183
download_size: 33059330
dataset_size: 47441140
- config_name: gl
features:
- name: audio_path
dtype: string
- name: duration
dtype: float32
- name: codes
sequence: int32
- name: sentence
dtype: string
- name: language
dtype: string
- name: client_id
dtype: string
splits:
- name: train
num_bytes: 30191921
num_examples: 25159
- name: validation
num_bytes: 12349179
num_examples: 9982
- name: test
num_bytes: 12741592
num_examples: 9990
download_size: 34316752
dataset_size: 55282692
configs:
- config_name: ar
data_files:
- split: train
path: ar/train-*
- split: validation
path: ar/validation-*
- split: test
path: ar/test-*
- config_name: de
data_files:
- split: validation
path: de/validation-*
- split: test
path: de/test-*
- config_name: gl
data_files:
- split: train
path: gl/train-*
- split: validation
path: gl/validation-*
- split: test
path: gl/test-*
---
# Dataset
## Dataset Overview
This dataset contains Common Voice speech data encoded into neural codec representations.
Each sample includes:
- `audio_path`
- `duration`
- `codes`
- `sentence`
- `language`
- `client_id`
The dataset is organized by language configuration and split into train, validation, and test sets when available.
## Dataset Statistics
The following table summarizes the number of examples for each `config_name` and split.
| config_name | train_examples | validation_examples | test_examples |
|---|---:|---:|---:|
| ar | 28.369 | 10.470 | 10.480 |
| de | — | 16.183 | 16.183 |
| gl | 25.159 | 9.982 | 9.990 |
### Notes
- `ar` and `gl` include `train`, `validation`, and `test` splits.
- `de` currently includes only `validation` and `test` splits in the dataset metadata.
## Features
- `audio_path` (`string`): path to the audio sample
- `duration` (`float32`): audio duration in seconds
- `codes` (`sequence[int32]`): neural codec token sequence
- `sentence` (`string`): transcription text
- `language` (`string`): language code
- `client_id` (`string`): speaker/client identifier
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("deepdml/commonvoice-neucodec", "ar")
print(dataset)
```
提供机构:
deepdml



