GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging

Name: GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
Creator: GenSEC-LLM
Published: 2024-06-11 16:39:39
License: 暂无描述

Hugging Face2024-06-11 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # Dataset Name: Dataset for ASR Speaker-Tagging Corrections (Speaker Diarization) ## Description - This dataset is pairs of erroneous ASR output and speaker tagging, which are generated from a ASR system and speaker diarization system. Each source erroneous transcription is paired with human-annotated transcription, which has correct transcription and speaker tagging. - [SEGment-wise Long-form Speech Transcription annotation](#segment-wise-long-form-speech-transcription-annotation-seglst) (`SegLST`), the file format used in the [CHiME challenges](https://www.chimechallenge.org) Example) `session_ge1nse2c.seglst.json` ``` [ ... { "session_id": "session_ge1nse2c", "words": "well that is the problem we have erroneous transcript and speaker tagging we want to correct it using large language models", "start_time": 181.88, "end_time": 193.3, "speaker": "speaker1" }, { "session_id": "session_ge1nse2c", "words": "it seems like a really interesting problem I feel that we can start with very simple methods", "start_time": 194.48, "end_time": 205.03, "speaker": "speaker2" }, ... ] ``` ## Structure ### Data Split The dataset is divided into training and test splits: - Training Data: 222 entries - 2 to 4 speakers in each session - Approximately 10 ~ 40 mins of recordings - Development Data: 13 entries - 2 speakers in each session - Approximately 10 mins of recordings - Evaluation Data: 11 entries - 2 speakers in each session - Approximately 10 mins of recordings ### Keys (items) - `session_id`: "session_ge1nse2c", - `words`: Transcription corresponding to the time stamp (start, end). - `start_time`: Start time in second. - `end_time`: End time in second. - `speaker`: Speaker tagging in string "speaker\<N\>" ### Source Datasets `err_source_text`: This is the erroneous ASR-Diarization results to be fixed. Has dev, eval folders `ref_annotated_text`: This is the human annotated ground-truth for evaluation. Only dev split is included. - **Training Sources**: - `dev`: 222 sessions - **Development Sources**: - `dev`: 13 sessions - **Evaluation Sources**: - `eval`: 11 Sessions ## Access The dataset can be accessed and downloaded through the HuggingFace Datasets library (i.e., This Repository). ## Evaluation This dataset can be evaluated by [MeetEval Software](https://github.com/fgnt/meeteval) ### From PyPI ``` pip install meeteval ``` ### From source ``` git clone https://github.com/fgnt/meeteval pip install -e ./meeteval ``` ### Evaluate the corrected segLST files: ``` python -m meeteval.wer cpwer -h err_source_text/dev/session_ge1nse2c.json -r ref_annotate_text/dev/session_ge1nse2c.json ``` Or after installation, you can use the following command alternatively. ``` meeteval-wer cpwer -h err_source_text/dev/session_ge1nse2c.json -r ref_annotate_text/dev/session_ge1nse2c.json ``` ### References ```bib @inproceedings{park2024enhancing, title={Enhancing speaker diarization with large language models: A contextual beam search approach}, author={Park, Tae Jin and Dhawan, Kunal and Koluguri, Nithin and Balam, Jagadeesh}, booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={10861--10865}, year={2024}, organization={IEEE} } ``` ```bib @InProceedings{MeetEval23, title={MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems}, author={von Neumann, Thilo and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach, Reinhold}, booktitle={CHiME-2023 Workshop, Dublin, England}, year={2023} } ```

提供机构：

GenSEC-LLM

原始信息汇总

数据集概述：ASR Speaker-Tagging Corrections (Speaker Diarization)

数据集描述

本数据集包含错误的自动语音识别（ASR）输出和说话人标记，这些数据由ASR系统和说话人分割系统生成。
每个错误的转录文本都与人工标注的转录文本配对，后者包含正确的转录和说话人标记。
使用SegLST文件格式，该格式用于CHiME挑战。

数据结构

数据分割

训练数据：222个条目，每个会话包含2至4个说话人，录音时长约10至40分钟。
开发数据：13个条目，每个会话包含2个说话人，录音时长约10分钟。
评估数据：11个条目，每个会话包含2个说话人，录音时长约10分钟。

关键字段

session_id：会话标识符。
words：与时间戳（开始，结束）对应的转录文本。
start_time：开始时间，单位为秒。
end_time：结束时间，单位为秒。
speaker：说话人标记，格式为"speaker<N>"。

源数据集

err_source_text：待修正的错误ASR-Diarization结果，包含dev和eval文件夹。
ref_annotated_text：用于评估的人工标注参考文本，仅包含dev分割。

评估方法

使用MeetEval软件进行评估。
可通过命令行工具meeteval-wer cpwer评估修正后的SegLST文件。

5,000+

优质数据集

54 个

任务类型

进入经典数据集