five

Extracting Biomedical Entities from Noisy Audio Transcripts--Dataset

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/10864062
下载链接
链接失效反馈
官方服务:
资源简介:
SUMMARY: This repo contains the CADEC and Synthetic BTACT datasets that were used for the paper titled "Extracting Biomedical Entities from Noisy Audio Transcripts." The dataset includes two sets: i) CADEC (Karimi et al., 2015) and ii) Synthetic BTACT. CADEC is a well-known NER dataset used to identify adverse drug reactions based on what patients have written about their experiences. Synthetic BTACT is the data that we have made up. It is created based on questions similar to those in the Brief Test of Adult Cognition by Telephone (BTACT)(Tun et al., 2006). CADEC includes two sets of audio files; one is read from the original CADEC, and the other one is with additional audio noise. It also includes the original CADEC scripts, annotations, and the transcripts of the noisy audio. The transcripts are generated using Whisper. The annotations encompass named entities, their types, and string indexes of their occurrence in the text. Annotations also include "AnnotatorNotes" which explains some of the annotations. The synthetic BTACT data include two types: i) animals and ii) fruits. Similar to CADEC, it includes two sets of audio files: one that is read from the original scripts and another one with additional audio noise.  The text files include the original scripts, annotations, and the Whisper-transcribed of the noisy audio files. The annotations include indexes of named entities, their string indices and types. REFERENCES: Karimi, S., Metke-Jimenez, A., Kemp, M., & Wang, C. (2015). Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, 55, 73-81. Tun, P. A., & Lachman, M. E. (2006). Telephone assessment of cognitive function in adulthood: the Brief Test of Adult Cognition by Telephone. Age and Ageing, 35(6), 629-632. DETAILS: Data_1: CADEC (1250 TextFiles, 1000 Audio, types=5):     General Categories and Counts    ADR (Adverse Drug Reactions): 5316    DRUG: 1797    FINDING: 397    DISEASE: 280    SYMPTOM: 255    Specific Items (Drugs) and Counts    Arthrotec: 145    cambia: 4    cataflam: 10    diclofenac-potassium: 3    diclofenac-sodium: 7    flector: 1    Lipitor: 997    Pennsaid: 4    solarez: 3    voltaren: 46    voltaren-rx: 22    zipsor: 5 Data_2: Synthetic BTACT (500 Fruits, 500 Animals, types=2) >> Audios can be matched with annotations, scripts and transcripts using their filenames.  ---audio [original&noisy]:    1. cadec        1.1 cadec original        1.2 cadec noisy    2. synthetic btact        2.1 btact original            2.1.1 fruits                fruit-script[0:500].mp3            2.1.2 animals                script-[0:500].mp3        2.2 btact noisy            2.2.1 fruits                fruit-script[0:500].mp3            2.2.2 animals                script-[0:500].mp3text[scripts, annotations, transcripts]:    1. cadec        1.1 scripts [1,250]        1.2 annotations [1,250] (index/AnnotatorsNote, type, indices, named-entities)        1.3 transcripts [1,000]    2. synthetic btact        2.1 animals            2.1 scripts (original scripts)                script-[0:500].txt            2.2 annotations                script-[0:500].ann (index, type, start/end indices, named entity)            2.3 transcripts                script-[0:500].txt        2.2. fruits            2.1 scripts (original scripts)                script-[0:500].txt            2.2 annotations                script-[0:500].ann (index, type, start/end indices, named entity)            2.3 transcripts                fruit-script-[0:500].txt   CITATION: Ebadi, N., Morgan, K., Tan, A., Linares, B., Osborn, S., Majors, E., Davis, J., & Rios, A. (2024). Extracting biomedical entities from noisy audio transcripts. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024).
创建时间:
2024-03-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作