Libriheavy-HQ
收藏魔搭社区2025-09-16 更新2025-09-20 收录
下载链接:
https://modelscope.cn/datasets/ForFuture/Libriheavy-HQ
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Libriheavy-HQ
<!-- Provide a quick summary of the dataset. -->
[Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy): a 50,000 hours ASR corpus with punctuation casing
and context. Libriheavy is a labeled version of Libri-Light.
Libriheavy-HQ replaces the default Libri-Light audio files with the highest quality available versions from librivox
without re-encoding them.
In most cases, this consists an upgrade of the source audio from a 64kbps .mp3 to a 128kbps .mp3.
## Overview
This is the Libriheavy-HQ dataset, adapted for the `datasets` library.
500 hours of audio are currently available in the "small" subset. Additional subsets will be added in the future.
## Usage
### Subsets
Currently, only the "small" subset of [Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy) is available.
In the future, all listed subsets will be available.
The default configuration is "small".
- "small": 509 hours of speech. 417 speakers averaging 1.22 hours per speaker About 28 Gb.
- "medium": 5042 hours of speech. 1531 speakers averaging 3.29 hours per speaker.
- "large": 50794 hours of speech. 6736 speakers averaging 7.54 hours per speaker.
- "dev": 22.3 hours of speech. 141 speakers averaging 0.16 hours per speaker.
- "test.clean": 10.5 hours of speech. 70 speakers averaging 0.15 hours per speaker.
- "test.other": 11.5 hours of speech. 72 speakers averaging 0.16 hours per speaker.
- "test.clean.large": 107.5 hours of speech. 72 speakers averaging 1.49 hours per speaker.
- "test.other.large": 100.3 hours of speech. 73 speakers averaging 1.37 hours per speaker.
### Example
Loading the `small` config with only the `train` split.
```
load_dataset("mythicinfinity/libriheavy-hq", "small", split="train")
```
Streaming is also supported.
```
load_dataset("mythicinfinity/libriheavy-hq", streaming=True)
```
### Columns
```
{
"id": datasets.Value("string"),
"speaker_id": datasets.Value("string"),
"audio": datasets.Audio(sampling_rate=44_100, mono=True),
"audio_duration": datasets.Value("float32"),
"text_original": datasets.Value("string"),
"text_transcription": datasets.Value("string"),
"librivox_book_id": datasets.Value("string"),
}
```
## Dataset Details
### Dataset Description
- **Libriheavy License:** Apache 2.0
### Dataset Sources [optional]
<!-- Provide the basic links for the dataset. -->
- **Libriheavy Homepage:** https://github.com/k2-fsa/libriheavy
- **Libriheavy Paper:** https://arxiv.org/abs/2309.08105
## Citations
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
```
@misc{Thornbury2024LibriheavyHQ,
author = {{Thornbury, Bryan and Mythic Infinity Labs}},
title = {{Libriheavy-HQ}},
year = {2024},
url = {https://huggingface.co/datasets/mythicinfinity/libriheavy-hq},
}
@misc{kang2023libriheavy,
title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context},
author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey},
year={2023},
eprint={2309.08105},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
```
# Libriheavy-HQ 数据集卡片
<!-- 请在此处提供数据集的简要概述。 -->
[Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy):一款包含标点、大小写标注与上下文信息的50,000小时自动语音识别(Automatic Speech Recognition, ASR)语料库,是Libri-Light的带标注版本。
Libriheavy-HQ 未进行重新编码,直接将 Libri-Light 的默认音频文件替换为来自 LibriVox 的最高可用质量版本。多数情况下,该操作可将源音频从64kbps的.mp3格式升级至128kbps的.mp3格式。
## 概述
本数据集为适配`datasets`库的Libriheavy-HQ数据集。目前「small」子集已提供500小时音频数据,未来将新增更多子集。
## 使用方法
### 子集列表
目前仅可使用[Libriheavy](https://huggingface.co/datasets/pkufool/libriheavy)的「small」子集。未来将开放所有列出自集,默认配置为「small」。
- 「small」:509小时语音数据,涵盖417位说话人,单说话人平均语音时长1.22小时,总数据量约28GB。
- 「medium」:5042小时语音数据,涵盖1531位说话人,单说话人平均语音时长3.29小时。
- 「large」:50794小时语音数据,涵盖6736位说话人,单说话人平均语音时长7.54小时。
- 「dev」:22.3小时语音数据,涵盖141位说话人,单说话人平均语音时长0.16小时。
- 「test.clean」:10.5小时语音数据,涵盖70位说话人,单说话人平均语音时长0.15小时。
- 「test.other」:11.5小时语音数据,涵盖72位说话人,单说话人平均语音时长0.16小时。
- 「test.clean.large」:107.5小时语音数据,涵盖72位说话人,单说话人平均语音时长1.49小时。
- 「test.other.large」:100.3小时语音数据,涵盖73位说话人,单说话人平均语音时长1.37小时。
### 使用示例
仅加载`train`划分的`small`配置:
load_dataset("mythicinfinity/libriheavy-hq", "small", split="train")
同时支持流式加载:
load_dataset("mythicinfinity/libriheavy-hq", streaming=True)
### 数据列结构
{
"id": datasets.Value("string"),
"speaker_id": datasets.Value("string"),
"audio": datasets.Audio(sampling_rate=44_100, mono=True),
"audio_duration": datasets.Value("float32"),
"text_original": datasets.Value("string"),
"text_transcription": datasets.Value("string"),
"librivox_book_id": datasets.Value("string"),
}
## 数据集详情
### 数据集描述
- **Libriheavy 许可协议**:Apache 2.0
### 数据集来源 [可选]
<!-- 请在此处提供数据集的基础链接。 -->
- **Libriheavy 主页**:https://github.com/k2-fsa/libriheavy
- **Libriheavy 论文**:https://arxiv.org/abs/2309.08105
## 引用信息
<!-- 若有介绍该数据集的论文或博客文章,请在此处附上其APA与BibTeX格式引用信息。 -->
@misc{Thornbury2024LibriheavyHQ,
author = {{Thornbury, Bryan and Mythic Infinity Labs}},
title = {{Libriheavy-HQ}},
year = {2024},
url = {https://huggingface.co/datasets/mythicinfinity/libriheavy-hq},
}
@misc{kang2023libriheavy,
title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context},
author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey},
year={2023},
eprint={2309.08105},
archivePrefix={arXiv},
primaryClass={eess.AS}
}
提供机构:
maas
创建时间:
2025-09-16



