FluidInference/cv-corpus-25.0-ja

Name: FluidInference/cv-corpus-25.0-ja
Creator: FluidInference
Published: 2026-04-03 17:19:09
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/FluidInference/cv-corpus-25.0-ja

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ja license: cc0-1.0 task_categories: - automatic-speech-recognition pretty_name: Mozilla Common Voice 25.0 - Japanese Test Set size_categories: - 1K<n<10K tags: - speech - audio - japanese - asr - common-voice - test-set --- # Mozilla Common Voice 25.0 - Japanese Test Set (Complete) ## Dataset Description Complete Japanese test set from Mozilla Common Voice Corpus 25.0. This dataset contains **all 9,019 validated test samples**, compared to the partial 2,334-sample version previously available on HuggingFace. ### Key Features - **Size**: 9,019 validated test utterances - **Coverage**: 100% of official Common Voice 25.0 Japanese test split - **Multi-speaker**: Diverse set of speakers with demographic metadata - **Quality**: Community-validated recordings - **Format**: MP3 audio files with full metadata - **Use case**: Standard test set for Japanese ASR evaluation ## Why This Dataset? The previous HuggingFace repository (FluidInference/cv-corpus-25.0-ja) only contained **2,334 test files (26%)** due to incomplete uploads. This dataset provides: - ✅ **All 9,019 test files (100%)** - ✅ Complete metadata matching official Mozilla release - ✅ Ready-to-use format for ASR benchmarking - ✅ No missing files or metadata mismatches ## Dataset Structure ### Files ``` cv-corpus-25.0-ja-test-only/ ├── manifest.json # Dataset manifest with split information ├── load_dataset.py # Helper script to load all splits ├── ja_00/ │ ├── clips/ # 3,000 MP3 files │ └── test.jsonl # Metadata for this split ├── ja_01/ │ ├── clips/ # 3,000 MP3 files │ └── test.jsonl # Metadata for this split ├── ja_02/ │ ├── clips/ # 3,000 MP3 files │ └── test.jsonl # Metadata for this split └── ja_03/ ├── clips/ # 19 MP3 files └── test.jsonl # Metadata for this split ``` ### Metadata Format Each `test.jsonl` file contains entries like: ```json { "file_name": "common_voice_ja_12345.mp3", "path": "clips/common_voice_ja_12345.mp3", "text": "Japanese transcription text", "client_id": "anonymous_speaker_id", "up_votes": 2, "down_votes": 0, "age": "thirties", "gender": "male", "accent": "", "locale": "ja" } ``` **Note**: Files are split across 4 directories (ja_00, ja_01, ja_02, ja_03) to comply with HuggingFace's 10,000 files per directory limit. Each directory contains a subset of the full test set. ## Dataset Manifest The `manifest.json` file provides complete information about the dataset structure: ```json { "dataset_name": "Mozilla Common Voice 25.0 - Japanese Test Set", "total_samples": 9019, "total_size_mb": 247.16, "num_splits": 4, "splits": [ {"name": "ja_00", "num_files": 3000, ...}, {"name": "ja_01", "num_files": 3000, ...}, {"name": "ja_02", "num_files": 3000, ...}, {"name": "ja_03", "num_files": 19, ...} ] } ``` ## Usage ### Loading with Python ```python import json from pathlib import Path def load_cv_test_set(dataset_dir="."): dataset_dir = Path(dataset_dir) samples = [] # Load from all splits (ja_00, ja_01, ja_02, ja_03) for split_dir in sorted(dataset_dir.glob("ja_[0-9][0-9]")): metadata_file = split_dir / "test.jsonl" with open(metadata_file, 'r', encoding='utf-8') as f: for line in f: entry = json.loads(line) entry['audio_path'] = str(split_dir / entry['path']) samples.append(entry) return samples # Load complete test set test_samples = load_cv_test_set() print(f"Loaded {len(test_samples)} test samples") ``` ### ASR Benchmarking ```python # Evaluate your ASR model for sample in test_samples: audio_path = sample['audio_path'] reference = sample['text'] # Your ASR inference here hypothesis = your_asr_model(audio_path) # Calculate CER/WER cer = calculate_cer(reference, hypothesis) ``` ## Dataset Statistics - **Total samples**: 9,019 - **Language**: Japanese (ja) - **Format**: MP3 (various bitrates) - **License**: CC0 1.0 (Public Domain) - **Source**: Mozilla Common Voice 25.0 - **Split**: Test only ## Comparison with Other Datasets | Dataset | Samples | Completeness | |---------|---------|--------------| | FluidInference/cv-corpus-25.0-ja | 2,334 | 26% | | **This dataset** | **9,019** | **100%** | | Original Mozilla CV 25.0 test | 9,019 | 100% | ## Citation If you use this dataset, please cite the original Common Voice project: ```bibtex @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = {2020} } ``` ## License CC0 1.0 Universal (Public Domain) The Common Voice dataset is released under CC0, meaning you can: - Use for any purpose (commercial or non-commercial) - Modify and redistribute - No attribution required (though appreciated) ## Original Source - **Project**: [Mozilla Common Voice](https://commonvoice.mozilla.org/) - **Version**: 25.0 (released 2026-03-09) - **Language**: Japanese (ja) - **Original download**: https://commonvoice.mozilla.org/ja/datasets ## Dataset Quality All samples in this dataset have been: - ✅ Validated by community members - ✅ Checked for audio quality - ✅ Verified for transcription accuracy - ✅ Filtered from invalidated/reported samples ## Use Cases - Japanese ASR model evaluation - Benchmarking speech recognition systems - Speaker diversity analysis - Accent and demographic studies - Standard test set for research papers ## Acknowledgments Thanks to the Mozilla Common Voice community and all the contributors who recorded and validated these audio samples.

提供机构：

FluidInference

搜集汇总

数据集介绍

构建方式

在语音识别研究领域，构建高质量且具有代表性的测试集对于模型性能的客观评估至关重要。该数据集源自Mozilla Common Voice项目第25.0版本，其构建过程遵循了严谨的众包协作模式。项目通过公开平台征集了广泛的日语母语者参与录音，随后由社区志愿者对音频质量与转录文本的准确性进行多轮交叉验证，最终筛选出9,019个经过严格校验的样本，构成了完整的官方测试集。数据以MP3格式存储，并附有详细的说话人年龄、性别等元数据，确保了数据来源的透明性与可追溯性。

使用方法

在应用层面，该数据集为日语语音识别系统的性能评测提供了标准化方案。研究者可通过加载附带的Python脚本，便捷地读取分布于四个子目录中的所有音频文件及其对应的元数据。每个样本均包含音频文件路径和标准转录文本，便于直接计算字错误率或词错误率等核心指标。数据集采用CC0许可协议，允许用户自由地用于商业或非商业的模型训练、评估及学术研究。其完整的测试集规模确保了评估结果的统计显著性和可比性，是发表学术论文或进行系统横向对比的理想基准。

背景与挑战

背景概述

Mozilla Common Voice项目由Mozilla基金会于2017年发起，旨在构建一个开源、多语言的众包语音数据集，以推动自动语音识别技术的民主化发展。该项目的日语子集cv-corpus-25.0-ja作为其第25.0版本的一部分，于2026年3月发布，由全球志愿者社区通过录音与验证共同构建。其核心研究问题聚焦于为日语ASR模型提供高质量、多样化的标准测试集，涵盖不同年龄、性别和口音的语音样本，以解决语音技术中数据稀缺与代表性不足的挑战。该数据集不仅为学术界和工业界提供了可靠的基准评估工具，还促进了语音技术在全球语言中的公平性与包容性发展。

当前挑战

在自动语音识别领域，日语ASR面临独特挑战，包括复杂的音韵体系、多样的方言变体以及书面与口语间的差异，要求模型具备高度的语言适应能力。cv-corpus-25.0-ja作为测试集，需确保样本能全面覆盖这些语言特性，以准确评估模型在真实场景中的鲁棒性。在构建过程中，数据集面临众包数据质量的管控难题，如音频清晰度、背景噪声和转录准确性的统一校验；同时，维护说话人多样性并平衡人口统计学分布亦需精细设计，以避免数据偏差影响评估公正性。此外，技术层面需处理大规模音频文件的分割与元数据整合，确保与原始版本的一致性。

常用场景

解决学术问题

该数据集解决了日语语音识别研究中缺乏大规模、高质量公开测试集的问题。通过提供完整且经过严格验证的样本，它支持了模型泛化能力评估、跨方言或口音适应性研究，以及数据偏差分析等关键学术议题。其存在促进了日语ASR领域的标准化进程，使得不同研究之间的结果可比性显著增强，推动了算法公平性与包容性的深入探讨。

实际应用

在实际应用中，该数据集被广泛用于商业语音识别系统的性能测试与优化。企业可依据其评估结果改进日语语音助手、实时转录服务及交互式语音应答系统的识别精度。同时，在教育科技领域，它助力开发语言学习工具中的发音评估功能；在辅助技术中，则为视障用户提供更可靠的语音转文本服务奠定基础。

数据集最近研究