bastiendechamps/px-corpus

Name: bastiendechamps/px-corpus
Creator: bastiendechamps
Published: 2024-04-03 13:14:55
License: 暂无描述

Hugging Face2024-04-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bastiendechamps/px-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - fr license: cc-by-4.0 size_categories: - 1K<n<10K task_categories: - automatic-speech-recognition pretty_name: PxCorpus dataset_info: features: - name: audio dtype: audio - name: file_name dtype: string - name: transcription dtype: string - name: audio_name dtype: string - name: ner dtype: string - name: speaker_id dtype: int64 - name: speaker_age_range dtype: string - name: speaker_gender dtype: string - name: speaker_category dtype: string - name: drug sequence: string - name: d_dos_val sequence: string - name: d_dos_up sequence: string - name: dur_val sequence: string - name: dur_ut sequence: string - name: dos_val sequence: string - name: dos_uf sequence: string - name: rhythm_tdte sequence: string - name: rhythm_perday sequence: string - name: inn sequence: string - name: d_dos_form sequence: string - name: freq_ut sequence: string - name: rhythm_hour sequence: string - name: dos_cond sequence: string - name: qsp_val sequence: string - name: qsp_ut sequence: string - name: cma_event sequence: string - name: roa sequence: string - name: A sequence: string - name: max_unit_val sequence: string - name: max_unit_ut sequence: string - name: max_unit_uf sequence: string - name: d_dos_form_ext sequence: string - name: rhythm_rec_ut sequence: string - name: fasting sequence: string - name: freq_int_v1 sequence: string - name: freq_int_v1_ut sequence: string - name: re_val sequence: string - name: re_ut sequence: string - name: freq_val sequence: string - name: freq_int_v2 sequence: string - name: rhythm_rec_val sequence: string - name: min_gap_ut sequence: string - name: freq_startday sequence: string - name: freq_int_v2_ut sequence: string - name: min_gap_val sequence: string - name: freq_days sequence: string - name: medical_terms sequence: string splits: - name: train num_bytes: 252725374.904 num_examples: 1127 - name: test num_bytes: 175599765.0 num_examples: 570 - name: dev num_bytes: 61546023.0 num_examples: 283 download_size: 465682214 dataset_size: 489871162.90400004 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: dev path: data/dev-* tags: - medical --- # PxCorpus : A Spoken Drug Prescription Dataset in French PxCorpus is to the best of our knowledge, the first spoken medical drug prescriptions corpus to be distributed. It contains 4 hours of transcribed and annotated dialogues of drug prescriptions in French acquired through an experiment with 55 participants experts and non-experts in drug prescriptions. The automatic transcriptions were verified by human effort and aligned with semantic labels to allow training of NLP models. The data acquisition protocol was reviewed by medical experts and permit free distribution without breach of privacy and regulation. ## Overview of the Corpus The experiment has been performed in wild conditions with naive participants and medical experts. In total, the dataset includes 2067 recordings of 55 participants (38% non-experts, 25% doctors, 36% medical practitioners), manually transcribed and semantically annotated. | Category | Sessions | Recordings | Time(m)| |------------------| -------- | ---------- | ------ | | Medical experts | 258 | 434 | 94.83 | | Doctors | 230 | 570 | 105.21 | | Non experts | 415 | 977 | 62.13 | | Total | 903 | 1981 | 262.27 | ## License We hope that that the community will be able to benefit from the dataset which is distributed with an attribution 4.0 International (CC BY 4.0) Creative Commons licence. ## How to cite this corpus If you use the corpus or need more details please refer to the following paper: A spoken drug prescription datset in French for spoken Language Understanding @InProceedings{Kocabiyikoglu2022, author = "Alican Kocabiyikoglu and Fran{\c c}ois Portet and Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan Gavazzi", title = "A spoken drug prescription datset in French for spoken Language Understanding", booktitle = "13th Language Ressources and Evaluation Conference (LREC 2022)", year = "2022", location = "Marseille, France" } ## Dataset features * `path` -- Audio name * `text` -- Audio utterance * `ner` -- Semantic annotation from the original dataset * `speaker_id` -- Speaker ID * `speaker_age_range` -- Speaker age range * `speaker_gender` -- Speaker gender * `speaker_category` -- Speaker category (doctor, expert, non-expert) * Other column names are for the occurences of each NER tag, could be useful for computing some metrics

提供机构：

bastiendechamps

原始信息汇总

数据集概述

名称: PxCorpus

语言: 法语 (fr)

许可: 知识共享署名 4.0 国际许可 (CC BY 4.0)

类别:

大小: 1K<n<10K
任务: 自动语音识别

描述: PxCorpus 是一个包含4小时法语医疗处方对话的语音数据集，由55名专家和非专家参与者参与。数据集包含手动转录和语义标注的录音，适用于训练NLP模型。

数据集特征

audio: 音频数据
file_name: 文件名，字符串类型
transcription: 转录文本，字符串类型
audio_name: 音频名称，字符串类型
ner: 语义标注，字符串类型
speaker_id: 说话人ID，整数类型
speaker_age_range: 说话人年龄范围，字符串类型
speaker_gender: 说话人性别，字符串类型
speaker_category: 说话人分类（医生、专家、非专家），字符串类型
drug: 药物信息，字符串序列类型
d_dos_val: 药物剂量值，字符串序列类型
d_dos_up: 药物剂量上限，字符串序列类型
dur_val: 药物持续时间值，字符串序列类型
dur_ut: 药物持续时间单位，字符串序列类型
dos_val: 药物剂量值，字符串序列类型
dos_uf: 药物剂量单位，字符串序列类型
rhythm_tdte: 药物服用节奏，字符串序列类型
rhythm_perday: 每日药物服用节奏，字符串序列类型
inn: 国际非专利名称，字符串序列类型
d_dos_form: 药物剂型，字符串序列类型
freq_ut: 药物频率单位，字符串序列类型
rhythm_hour: 药物服用小时，字符串序列类型
dos_cond: 药物剂量条件，字符串序列类型
qsp_val: 药物质量规格值，字符串序列类型
qsp_ut: 药物质量规格单位，字符串序列类型
cma_event: 药物事件，字符串序列类型
roa: 药物给药途径，字符串序列类型
A: 药物类别A，字符串序列类型
max_unit_val: 最大单位值，字符串序列类型
max_unit_ut: 最大单位单位，字符串序列类型
max_unit_uf: 最大单位格式，字符串序列类型
d_dos_form_ext: 药物剂型扩展，字符串序列类型
rhythm_rec_ut: 推荐药物服用节奏单位，字符串序列类型
fasting: 禁食状态，字符串序列类型
freq_int_v1: 药物频率间隔版本1，字符串序列类型
freq_int_v1_ut: 药物频率间隔版本1单位，字符串序列类型
re_val: 药物再评估值，字符串序列类型
re_ut: 药物再评估单位，字符串序列类型
freq_val: 药物频率值，字符串序列类型
freq_int_v2: 药物频率间隔版本2，字符串序列类型
rhythm_rec_val: 推荐药物服用节奏值，字符串序列类型
min_gap_ut: 最小间隔单位，字符串序列类型
freq_startday: 药物开始日频率，字符串序列类型
freq_int_v2_ut: 药物频率间隔版本2单位，字符串序列类型
min_gap_val: 最小间隔值，字符串序列类型
freq_days: 药物频率天数，字符串序列类型
medical_terms: 医疗术语，字符串序列类型

数据集分割

分割	字节数	示例数
train	252725374.904	1127
test	175599765.0	570
dev	61546023.0	283

下载大小: 465682214字节

数据集大小: 489871162.90400004字节

5,000+

优质数据集

54 个

任务类型

进入经典数据集