Extracting Biomedical Entities from Noisy Audio Transcripts--Dataset
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10864062
下载链接
链接失效反馈官方服务:
资源简介:
SUMMARY:
This repo contains the CADEC and Synthetic BTACT datasets that were used for the paper titled "Extracting Biomedical Entities from Noisy Audio Transcripts."
The dataset includes two sets: i) CADEC (Karimi et al., 2015) and ii) Synthetic BTACT. CADEC is a well-known NER dataset used to identify adverse drug reactions based on what patients have written about their experiences. Synthetic BTACT is the data that we have made up. It is created based on questions similar to those in the Brief Test of Adult Cognition by Telephone (BTACT)(Tun et al., 2006).
CADEC includes two sets of audio files; one is read from the original CADEC, and the other one is with additional audio noise. It also includes the original CADEC scripts, annotations, and the transcripts of the noisy audio. The transcripts are generated using Whisper. The annotations encompass named entities, their types, and string indexes of their occurrence in the text. Annotations also include "AnnotatorNotes" which explains some of the annotations.
The synthetic BTACT data include two types: i) animals and ii) fruits. Similar to CADEC, it includes two sets of audio files: one that is read from the original scripts and another one with additional audio noise. The text files include the original scripts, annotations, and the Whisper-transcribed of the noisy audio files. The annotations include indexes of named entities, their string indices and types.
REFERENCES:
Karimi, S., Metke-Jimenez, A., Kemp, M., & Wang, C. (2015). Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55, 73-81.
Tun, P. A., & Lachman, M. E. (2006). Telephone assessment of cognitive function in adulthood: the Brief Test of Adult Cognition by Telephone. Age and Ageing, 35(6), 629-632.
DETAILS:
Data_1: CADEC (1250 TextFiles, 1000 Audio, types=5):
General Categories and Counts ADR (Adverse Drug Reactions): 5316 DRUG: 1797 FINDING: 397 DISEASE: 280 SYMPTOM: 255 Specific Items (Drugs) and Counts Arthrotec: 145 cambia: 4 cataflam: 10 diclofenac-potassium: 3 diclofenac-sodium: 7 flector: 1 Lipitor: 997 Pennsaid: 4 solarez: 3 voltaren: 46 voltaren-rx: 22 zipsor: 5
Data_2: Synthetic BTACT (500 Fruits, 500 Animals, types=2)
>> Audios can be matched with annotations, scripts and transcripts using their filenames.
---audio [original&noisy]: 1. cadec 1.1 cadec original 1.2 cadec noisy 2. synthetic btact 2.1 btact original 2.1.1 fruits fruit-script[0:500].mp3 2.1.2 animals script-[0:500].mp3 2.2 btact noisy 2.2.1 fruits fruit-script[0:500].mp3 2.2.2 animals script-[0:500].mp3text[scripts, annotations, transcripts]: 1. cadec 1.1 scripts [1,250] 1.2 annotations [1,250] (index/AnnotatorsNote, type, indices, named-entities) 1.3 transcripts [1,000] 2. synthetic btact 2.1 animals 2.1 scripts (original scripts) script-[0:500].txt 2.2 annotations script-[0:500].ann (index, type, start/end indices, named entity) 2.3 transcripts script-[0:500].txt 2.2. fruits 2.1 scripts (original scripts) script-[0:500].txt 2.2 annotations script-[0:500].ann (index, type, start/end indices, named entity) 2.3 transcripts fruit-script-[0:500].txt
CITATION:
Ebadi, N., Morgan, K., Tan, A., Linares, B., Osborn, S., Majors, E., Davis, J., & Rios, A. (2024). Extracting biomedical entities from noisy audio transcripts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
创建时间:
2024-03-25



