Menlo/instruction-speech-encodec-v1
收藏Hugging Face2024-08-19 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Menlo/instruction-speech-encodec-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
tags:
- general
- audio2text
- multimodal model
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: data-*
---
# Dataset Card for "Instruction Speech"
> The largest open-source English speech instruction to text answer dataset
## Dataset Overview
This dataset contains nearly 450,000 English `speech instruction to text answer` samples, using:
- A subset of [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) with user's prompt length less than 64.
- Audio generation using [WhisperSpeech](https://github.com/collabora/whisperspeech).
- Tokenized using [Encodec](https://github.com/facebookresearch/encodec).
## Usage
```python
from datasets import load_dataset, Audio
# Load Instruction Speech dataset
dataset = load_dataset("homebrewltd/instruction-speech-encodec-v1",split='train')
```
## Dataset Fields
Field | Type | Description |
|------------------|------------|--------------------------------------------------|
| `prompt` | string | User's query |
| `answer` | string | Assistant's answer |
| `length` | int | Length of user's query |
| `audio` | audio | Audio files |
| `tokens` | sequence | Tokenized using Encodec |
## Bias, Risks, and Limitations
- Dataset may reflect biases inherent in its source.
- Current version lacks quality control for prompts and responses.
- The usage of Encodec may compromise sound tokens quality.
- Users should consider these limitations when applying the dataset.
## Licensing Information
The dataset is released under the [MIT license](https://opensource.org/license/MIT).
## Citation Information
```
@article{Instruction Speech 2024,
title={Instruction Speech},
author={JanAI},
year=2024,
month=June},
url={https://huggingface.co/datasets/jan-hq/instruction-speech}
```
提供机构:
Menlo



