five

AIDC-AI/Marco_Longspeech

收藏
Hugging Face2026-04-21 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/AIDC-AI/Marco_Longspeech
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh tags: - audio - speech - asr - speech-recognition - question-answering - summarization - translation - emotion-recognition - speaker-diarization license: apache-2.0 task_categories: - automatic-speech-recognition - audio-classification - text-generation size_categories: - 10K<n<100K --- *** # Marco-LongSpeech Dataset <div align="center"> [![arXiv](https://img.shields.io/badge/arXiv-2601.13539-b31b1b.svg)](https://arxiv.org/abs/2601.13539) [![GitHub](https://img.shields.io/badge/GitHub-Repo-121013.svg)](https://github.com/AIDC-AI/Marco-Longspeech) </div> Marco-LongSpeech is a multi-task long speech understanding dataset containing 8 different speech understanding tasks designed to benchmark Large Language Models on lengthy audio inputs. ## 📊 Dataset Statistics ### Task Statistics | Task | Train | Val | Test | Total | Unique Audios | |------|-------|-----|------|-------|---------------| | ASR | 71,275 | 15,273 | 15,274 | 101,822 | 101,822 | | Temporal_Relative_QA | 5,886 | 1,261 | 1,262 | 8,409 | 8,409 | | summary | 4,366 | 935 | 937 | 6,238 | 6,238 | | content_separation | 5,887 | 1,261 | 1,263 | 8,411 | 8,411 | | emotionQA | 5,887 | 1,261 | 1,263 | 8,411 | 8,411 | | speaker_count | 5,887 | 1,261 | 1,263 | 8,411 | 8,411 | | translation | 29,435 | 6,307 | 6,309 | 42,051 | 8,411 | | language_detection | 14,789 | 3,169 | 3,170 | 21,128 | 21,128 | | **Total** | **143,412** | **30,728** | **30,741** | **204,881** | - | ### Audio Subset Statistics | Subset | WAV Files | all_audios.jsonl | metadata.json | |--------|-----------|------------------|----------------| | LongSpeech_p1 | 29,539 | ✓ | ✓ | | LongSpeech_p2 | 22,107 | ✓ | ✓ | | LongSpeech_p3 | 50,176 | ✓ | ✓ | | **Total** | **101,822** | - | - | ## 📁 Dataset Structure ```text LongSpeech-Dataset/ ├── LongSpeechQA/ # QA data for 8 tasks │ ├── ASR/ # Automatic Speech Recognition │ │ ├── train.jsonl │ │ ├── val.jsonl │ │ └── test.jsonl │ ├── Temporal_Relative_QA/ # Temporal Relative QA │ ├── summary/ # Summarization │ ├── content_separation/ # Content Separation │ ├── emotionQA/ # Emotion QA │ ├── speaker_count/ # Speaker Count │ ├── translation/ # Translation │ └── language_detection/ # Language Detection ├── LongSpeech_p1/ │ ├── wavs/ │ ├── all_audios.jsonl │ └── metadata.json ├── LongSpeech_p2/ │ ├── wavs/ │ ├── all_audios.jsonl │ └── metadata.json ├── LongSpeech_p3/ │ ├── wavs/ │ ├── all_audios.jsonl │ └── metadata.json └── README.md ``` ## 🎯 Task Descriptions The dataset covers a comprehensive range of capabilities required for long speech understanding: * **ASR & S2T Translation**: Core transcription and translation of full-length audio. * **Summarization**: Generating concise summaries from lengthy recordings. * **Speaker Count & Language Detection**: Identifying speaker and language attributes. * **Content Separation**: Detecting unrelated concatenated content to test coherence. * **QA & Temporal Localization**: Evaluating comprehension, reasoning, and temporal tracking. * **Emotion Analysis**: Determining the overall emotional tone of the speech. ## 📝 Data Format Each task's `jsonl` file follows the format below: ```json { "language": "en", "task": "ASR", "messages": [ { "role": "user", "audio": "LongSpeech_p1/wavs/013429.wav", "content": "Detect the language and recognize the speech: <|en|>" }, { "role": "assistant", "content": "We wont feel compelled in any way to pay at the top end or...." } ] } ``` ### Field Explanations - `language`: Speech language code (e.g., en, zh). - `task`: The type of task (e.g., ASR, summary). - `messages`: A list of dialogue messages. - `role`: The role of the speaker (`user` or `assistant`). - `audio`: The relative path to the audio file. - `content`: Text content (user instructions or assistant responses). ## 🚀 Usage ### Loading with Hugging Face Datasets ```python from datasets import load_dataset # Load data for a specific task (e.g., ASR) dataset = load_dataset("your-username/LongSpeech-Dataset", data_files={ "train": "LongSpeechQA/ASR/train.jsonl", "val": "LongSpeechQA/ASR/val.jsonl", "test": "LongSpeechQA/ASR/test.jsonl" }) print(dataset) ``` ### Loading Audio Files ```python import os from datasets import load_dataset # Assuming the dataset has been downloaded locally dataset = load_dataset("json", data_files="LongSpeechQA/ASR/train.jsonl") # Retrieve audio paths for example in dataset["train"]: audio_path = example["messages"][0].get("audio") if audio_path: # Adjust 'your_download_path' to where you stored the LongSpeech_p* folders full_path = os.path.join("your_download_path", audio_path) print(f"Audio: {full_path}") ``` ## 📚 Citation If you find this dataset useful, please cite our paper: ```bibtex @article{yang2026longspeech, title={LongSpeech: A Scalable Benchmark for Transcription, Translation and Understanding in Long Speech}, author={Yang, Fei and Ni, Xuanfan and Yang, Renyi and Geng, Jiahui and Li, Qing and Lyu, Chenyang and Du, Yichao and Wang, Longyue and Luo, Weihua and Zhang, Kaifu}, journal={arXiv preprint arXiv:2601.13539}, year={2026} } ```
提供机构:
AIDC-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作