Libriheavy-HQ

Name: Libriheavy-HQ
Creator: maas
Published: 2025-09-16 14:23:14
License: 暂无描述

魔搭社区2025-09-16 更新2025-09-20 收录

下载链接：

https://modelscope.cn/datasets/ForFuture/Libriheavy-HQ

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Libriheavy-HQ  [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy): a 50,000 hours ASR corpus with punctuation casing and context. Libriheavy is a labeled version of Libri-Light. Libriheavy-HQ replaces the default Libri-Light audio files with the highest quality available versions from librivox without re-encoding them. In most cases, this consists an upgrade of the source audio from a 64kbps .mp3 to a 128kbps .mp3. ## Overview This is the Libriheavy-HQ dataset, adapted for the `datasets` library. 500 hours of audio are currently available in the "small" subset. Additional subsets will be added in the future. ## Usage ### Subsets Currently, only the "small" subset of [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy) is available. In the future, all listed subsets will be available. The default configuration is "small". - "small": 509 hours of speech. 417 speakers averaging 1.22 hours per speaker About 28 Gb. - "medium": 5042 hours of speech. 1531 speakers averaging 3.29 hours per speaker. - "large": 50794 hours of speech. 6736 speakers averaging 7.54 hours per speaker. - "dev": 22.3 hours of speech. 141 speakers averaging 0.16 hours per speaker. - "test.clean": 10.5 hours of speech. 70 speakers averaging 0.15 hours per speaker. - "test.other": 11.5 hours of speech. 72 speakers averaging 0.16 hours per speaker. - "test.clean.large": 107.5 hours of speech. 72 speakers averaging 1.49 hours per speaker. - "test.other.large": 100.3 hours of speech. 73 speakers averaging 1.37 hours per speaker. ### Example Loading the `small` config with only the `train` split. ``` load_dataset("mythicinfinity/libriheavy-hq", "small", split="train") ``` Streaming is also supported. ``` load_dataset("mythicinfinity/libriheavy-hq", streaming=True) ``` ### Columns ``` { "id": datasets.Value("string"), "speaker_id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=44_100, mono=True), "audio_duration": datasets.Value("float32"), "text_original": datasets.Value("string"), "text_transcription": datasets.Value("string"), "librivox_book_id": datasets.Value("string"), } ``` ## Dataset Details ### Dataset Description - **Libriheavy License:** Apache 2.0 ### Dataset Sources [optional]  - **Libriheavy Homepage:** https://github.com/k2-fsa/libriheavy - **Libriheavy Paper:** https://arxiv.org/abs/2309.08105 ## Citations  ``` @misc{Thornbury2024LibriheavyHQ, author = {{Thornbury, Bryan and Mythic Infinity Labs}}, title = {{Libriheavy-HQ}}, year = {2024}, url = {https://huggingface.co/datasets/mythicinfinity/libriheavy-hq}, } @misc{kang2023libriheavy, title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey}, year={2023}, eprint={2309.08105}, archivePrefix={arXiv}, primaryClass={eess.AS} } ```

# Libriheavy-HQ 数据集卡片  [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy)：一款包含标点、大小写标注与上下文信息的50,000小时自动语音识别（Automatic Speech Recognition, ASR）语料库，是Libri-Light的带标注版本。 Libriheavy-HQ 未进行重新编码，直接将 Libri-Light 的默认音频文件替换为来自 LibriVox 的最高可用质量版本。多数情况下，该操作可将源音频从64kbps的.mp3格式升级至128kbps的.mp3格式。 ## 概述本数据集为适配`datasets`库的Libriheavy-HQ数据集。目前「small」子集已提供500小时音频数据，未来将新增更多子集。 ## 使用方法 ### 子集列表目前仅可使用[Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy)的「small」子集。未来将开放所有列出自集，默认配置为「small」。 - 「small」：509小时语音数据，涵盖417位说话人，单说话人平均语音时长1.22小时，总数据量约28GB。 - 「medium」：5042小时语音数据，涵盖1531位说话人，单说话人平均语音时长3.29小时。 - 「large」：50794小时语音数据，涵盖6736位说话人，单说话人平均语音时长7.54小时。 - 「dev」：22.3小时语音数据，涵盖141位说话人，单说话人平均语音时长0.16小时。 - 「test.clean」：10.5小时语音数据，涵盖70位说话人，单说话人平均语音时长0.15小时。 - 「test.other」：11.5小时语音数据，涵盖72位说话人，单说话人平均语音时长0.16小时。 - 「test.clean.large」：107.5小时语音数据，涵盖72位说话人，单说话人平均语音时长1.49小时。 - 「test.other.large」：100.3小时语音数据，涵盖73位说话人，单说话人平均语音时长1.37小时。 ### 使用示例仅加载`train`划分的`small`配置： load_dataset("mythicinfinity/libriheavy-hq", "small", split="train") 同时支持流式加载： load_dataset("mythicinfinity/libriheavy-hq", streaming=True) ### 数据列结构 { "id": datasets.Value("string"), "speaker_id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=44_100, mono=True), "audio_duration": datasets.Value("float32"), "text_original": datasets.Value("string"), "text_transcription": datasets.Value("string"), "librivox_book_id": datasets.Value("string"), } ## 数据集详情 ### 数据集描述 - **Libriheavy 许可协议**：Apache 2.0 ### 数据集来源 [可选]  - **Libriheavy 主页**：https://github.com/k2-fsa/libriheavy - **Libriheavy 论文**：https://arxiv.org/abs/2309.08105 ## 引用信息  @misc{Thornbury2024LibriheavyHQ, author = {{Thornbury, Bryan and Mythic Infinity Labs}}, title = {{Libriheavy-HQ}}, year = {2024}, url = {https://huggingface.co/datasets/mythicinfinity/libriheavy-hq}, } @misc{kang2023libriheavy, title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey}, year={2023}, eprint={2309.08105}, archivePrefix={arXiv}, primaryClass={eess.AS} }

提供机构：

maas

创建时间：

2025-09-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集