GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
收藏Hugging Face2024-06-11 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
---
# Dataset Name: Dataset for ASR Speaker-Tagging Corrections (Speaker Diarization)
## Description
- This dataset is pairs of erroneous ASR output and speaker tagging, which are generated from a ASR system and speaker diarization system.
Each source erroneous transcription is paired with human-annotated transcription, which has correct transcription and speaker tagging.
- [SEGment-wise Long-form Speech Transcription annotation](#segment-wise-long-form-speech-transcription-annotation-seglst) (`SegLST`), the file format used in the [CHiME challenges](https://www.chimechallenge.org)
Example) `session_ge1nse2c.seglst.json`
```
[
...
{
"session_id": "session_ge1nse2c",
"words": "well that is the problem we have erroneous transcript and speaker tagging we want to correct it using large language models",
"start_time": 181.88,
"end_time": 193.3,
"speaker": "speaker1"
},
{
"session_id": "session_ge1nse2c",
"words": "it seems like a really interesting problem I feel that we can start with very simple methods",
"start_time": 194.48,
"end_time": 205.03,
"speaker": "speaker2"
},
...
]
```
## Structure
### Data Split
The dataset is divided into training and test splits:
- Training Data: 222 entries
- 2 to 4 speakers in each session
- Approximately 10 ~ 40 mins of recordings
- Development Data: 13 entries
- 2 speakers in each session
- Approximately 10 mins of recordings
- Evaluation Data: 11 entries
- 2 speakers in each session
- Approximately 10 mins of recordings
### Keys (items)
- `session_id`: "session_ge1nse2c",
- `words`: Transcription corresponding to the time stamp (start, end).
- `start_time`: Start time in second.
- `end_time`: End time in second.
- `speaker`: Speaker tagging in string "speaker\<N\>"
### Source Datasets
`err_source_text`: This is the erroneous ASR-Diarization results to be fixed. Has dev, eval folders
`ref_annotated_text`: This is the human annotated ground-truth for evaluation. Only dev split is included.
- **Training Sources**:
- `dev`: 222 sessions
- **Development Sources**:
- `dev`: 13 sessions
- **Evaluation Sources**:
- `eval`: 11 Sessions
## Access
The dataset can be accessed and downloaded through the HuggingFace Datasets library (i.e., This Repository).
## Evaluation
This dataset can be evaluated by [MeetEval Software](https://github.com/fgnt/meeteval)
### From PyPI
```
pip install meeteval
```
### From source
```
git clone https://github.com/fgnt/meeteval
pip install -e ./meeteval
```
### Evaluate the corrected segLST files:
```
python -m meeteval.wer cpwer -h err_source_text/dev/session_ge1nse2c.json -r ref_annotate_text/dev/session_ge1nse2c.json
```
Or after installation, you can use the following command alternatively.
```
meeteval-wer cpwer -h err_source_text/dev/session_ge1nse2c.json -r ref_annotate_text/dev/session_ge1nse2c.json
```
### References
```bib
@inproceedings{park2024enhancing,
title={Enhancing speaker diarization with large language models: A contextual beam search approach},
author={Park, Tae Jin and Dhawan, Kunal and Koluguri, Nithin and Balam, Jagadeesh},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={10861--10865},
year={2024},
organization={IEEE}
}
```
```bib
@InProceedings{MeetEval23,
title={MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems},
author={von Neumann, Thilo and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach, Reinhold},
booktitle={CHiME-2023 Workshop, Dublin, England},
year={2023}
}
```
提供机构:
GenSEC-LLM
原始信息汇总
数据集概述:ASR Speaker-Tagging Corrections (Speaker Diarization)
数据集描述
- 本数据集包含错误的自动语音识别(ASR)输出和说话人标记,这些数据由ASR系统和说话人分割系统生成。
- 每个错误的转录文本都与人工标注的转录文本配对,后者包含正确的转录和说话人标记。
- 使用
SegLST文件格式,该格式用于CHiME挑战。
数据结构
数据分割
- 训练数据:222个条目,每个会话包含2至4个说话人,录音时长约10至40分钟。
- 开发数据:13个条目,每个会话包含2个说话人,录音时长约10分钟。
- 评估数据:11个条目,每个会话包含2个说话人,录音时长约10分钟。
关键字段
session_id:会话标识符。words:与时间戳(开始,结束)对应的转录文本。start_time:开始时间,单位为秒。end_time:结束时间,单位为秒。speaker:说话人标记,格式为"speaker<N>"。
源数据集
err_source_text:待修正的错误ASR-Diarization结果,包含dev和eval文件夹。ref_annotated_text:用于评估的人工标注参考文本,仅包含dev分割。
评估方法
- 使用MeetEval软件进行评估。
- 可通过命令行工具
meeteval-wer cpwer评估修正后的SegLST文件。



