aigc-x/Pronunciation-boldvoice
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/aigc-x/Pronunciation-boldvoice
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- audio-classification
language:
- en
size_categories:
- 10K<n<100K
tags:
- pronunciation-assessment
- phoneme
- speech
dataset_info:
features:
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: reference_text
dtype: string
- name: response
dtype: string
- name: source
dtype: string
- name: duration
dtype: float64
- name: score
dtype: int64
splits:
- name: train
num_bytes: 36605701980
num_examples: 43182
download_size: 36476677163
dataset_size: 36605701980
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Pronunciation Assessment Dataset (BoldVoice + speechocean762)
Dataset for fine-tuning multimodal models on English pronunciation assessment.
## Overview
| Source | Samples | Audio Duration | Description |
|--------|---------|---------------|-------------|
| BoldVoice | 38,182 | 10-20s | Non-native English learners, BoldVoice API annotations |
| speechocean762 | 5,000 | 1.6-20s | Public dataset, 5-expert scored, Mandarin speakers |
| **Total** | **43,182** | | |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `audio` | Audio (16kHz mono) | Speech recording |
| `reference_text` | string | Text the speaker intended to read |
| `response` | string | JSON annotation (see below) |
| `source` | string | `boldvoice` or `speechocean762` |
| `duration` | float | Audio duration in seconds |
| `score` | int | Overall pronunciation score (0-100) |
## Annotation Format (response JSON)
```json
{
"words": [
{
"word": "bear",
"expected": ["B", "EH", "R"],
"actual": ["B", "AH", "R"],
"is_correct": false,
"errors": [{"index": 1, "expected": "EH", "actual": "AH", "type": "substitution"}]
}
],
"summary": {
"total_phonemes": 3,
"correct_phonemes": 2,
"error_count": 1,
"score": 67
}
}
```
- Phonemes in **ARPAbet** notation (no stress markers)
- Error types: `substitution`, `deletion`, `insertion`, `mispronounced`
## Fine-tuning
```bash
pip install -r requirements.txt
# Fine-tune Gemma 4 E2B-it with LoRA
python finetune_gemma4_e2b.py --model google/gemma-4-E2B-it
# Custom settings
python finetune_gemma4_e2b.py --model /path/to/local/model --lr 1e-4 --epochs 2 --batch-size 2
```
## Token Budget
| Metric | Value |
|--------|-------|
| Median tokens/sample | 1,197 |
| p95 tokens/sample | 2,674 |
| Max tokens/sample | 6,143 |
提供机构:
aigc-x



