five

Libriheavy-HQ

收藏
魔搭社区2025-09-16 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/ForFuture/Libriheavy-HQ
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Libriheavy-HQ <!-- Provide a quick summary of the dataset. --> [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy): a 50,000 hours ASR corpus with punctuation casing and context. Libriheavy is a labeled version of Libri-Light. Libriheavy-HQ replaces the default Libri-Light audio files with the highest quality available versions from librivox without re-encoding them. In most cases, this consists an upgrade of the source audio from a 64kbps .mp3 to a 128kbps .mp3. ## Overview This is the Libriheavy-HQ dataset, adapted for the `datasets` library. 500 hours of audio are currently available in the "small" subset. Additional subsets will be added in the future. ## Usage ### Subsets Currently, only the "small" subset of [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy) is available. In the future, all listed subsets will be available. The default configuration is "small". - "small": 509 hours of speech. 417 speakers averaging 1.22 hours per speaker About 28 Gb. - "medium": 5042 hours of speech. 1531 speakers averaging 3.29 hours per speaker. - "large": 50794 hours of speech. 6736 speakers averaging 7.54 hours per speaker. - "dev": 22.3 hours of speech. 141 speakers averaging 0.16 hours per speaker. - "test.clean": 10.5 hours of speech. 70 speakers averaging 0.15 hours per speaker. - "test.other": 11.5 hours of speech. 72 speakers averaging 0.16 hours per speaker. - "test.clean.large": 107.5 hours of speech. 72 speakers averaging 1.49 hours per speaker. - "test.other.large": 100.3 hours of speech. 73 speakers averaging 1.37 hours per speaker. ### Example Loading the `small` config with only the `train` split. ``` load_dataset("mythicinfinity/libriheavy-hq", "small", split="train") ``` Streaming is also supported. ``` load_dataset("mythicinfinity/libriheavy-hq", streaming=True) ``` ### Columns ``` { "id": datasets.Value("string"), "speaker_id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=44_100, mono=True), "audio_duration": datasets.Value("float32"), "text_original": datasets.Value("string"), "text_transcription": datasets.Value("string"), "librivox_book_id": datasets.Value("string"), } ``` ## Dataset Details ### Dataset Description - **Libriheavy License:** Apache 2.0 ### Dataset Sources [optional] <!-- Provide the basic links for the dataset. --> - **Libriheavy Homepage:** https://github.com/k2-fsa/libriheavy - **Libriheavy Paper:** https://arxiv.org/abs/2309.08105 ## Citations <!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. --> ``` @misc{Thornbury2024LibriheavyHQ, author = {{Thornbury, Bryan and Mythic Infinity Labs}}, title = {{Libriheavy-HQ}}, year = {2024}, url = {https://huggingface.co/datasets/mythicinfinity/libriheavy-hq}, } @misc{kang2023libriheavy, title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey}, year={2023}, eprint={2309.08105}, archivePrefix={arXiv}, primaryClass={eess.AS} } ```

# Libriheavy-HQ 数据集卡片 <!-- 请在此处提供数据集的简要概述。 --> [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy):一款包含标点、大小写标注与上下文信息的50,000小时自动语音识别(Automatic Speech Recognition, ASR)语料库,是Libri-Light的带标注版本。 Libriheavy-HQ 未进行重新编码,直接将 Libri-Light 的默认音频文件替换为来自 LibriVox 的最高可用质量版本。多数情况下,该操作可将源音频从64kbps的.mp3格式升级至128kbps的.mp3格式。 ## 概述 本数据集为适配`datasets`库的Libriheavy-HQ数据集。目前「small」子集已提供500小时音频数据,未来将新增更多子集。 ## 使用方法 ### 子集列表 目前仅可使用[Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy)的「small」子集。未来将开放所有列出自集,默认配置为「small」。 - 「small」:509小时语音数据,涵盖417位说话人,单说话人平均语音时长1.22小时,总数据量约28GB。 - 「medium」:5042小时语音数据,涵盖1531位说话人,单说话人平均语音时长3.29小时。 - 「large」:50794小时语音数据,涵盖6736位说话人,单说话人平均语音时长7.54小时。 - 「dev」:22.3小时语音数据,涵盖141位说话人,单说话人平均语音时长0.16小时。 - 「test.clean」:10.5小时语音数据,涵盖70位说话人,单说话人平均语音时长0.15小时。 - 「test.other」:11.5小时语音数据,涵盖72位说话人,单说话人平均语音时长0.16小时。 - 「test.clean.large」:107.5小时语音数据,涵盖72位说话人,单说话人平均语音时长1.49小时。 - 「test.other.large」:100.3小时语音数据,涵盖73位说话人,单说话人平均语音时长1.37小时。 ### 使用示例 仅加载`train`划分的`small`配置: load_dataset("mythicinfinity/libriheavy-hq", "small", split="train") 同时支持流式加载: load_dataset("mythicinfinity/libriheavy-hq", streaming=True) ### 数据列结构 { "id": datasets.Value("string"), "speaker_id": datasets.Value("string"), "audio": datasets.Audio(sampling_rate=44_100, mono=True), "audio_duration": datasets.Value("float32"), "text_original": datasets.Value("string"), "text_transcription": datasets.Value("string"), "librivox_book_id": datasets.Value("string"), } ## 数据集详情 ### 数据集描述 - **Libriheavy 许可协议**:Apache 2.0 ### 数据集来源 [可选] <!-- 请在此处提供数据集的基础链接。 --> - **Libriheavy 主页**:https://github.com/k2-fsa/libriheavy - **Libriheavy 论文**:https://arxiv.org/abs/2309.08105 ## 引用信息 <!-- 若有介绍该数据集的论文或博客文章,请在此处附上其APA与BibTeX格式引用信息。 --> @misc{Thornbury2024LibriheavyHQ, author = {{Thornbury, Bryan and Mythic Infinity Labs}}, title = {{Libriheavy-HQ}}, year = {2024}, url = {https://huggingface.co/datasets/mythicinfinity/libriheavy-hq}, } @misc{kang2023libriheavy, title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey}, year={2023}, eprint={2309.08105}, archivePrefix={arXiv}, primaryClass={eess.AS} }
提供机构:
maas
创建时间:
2025-09-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作