bastiendechamps/px-corpus
收藏Hugging Face2024-04-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bastiendechamps/px-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- fr
license: cc-by-4.0
size_categories:
- 1K<n<10K
task_categories:
- automatic-speech-recognition
pretty_name: PxCorpus
dataset_info:
features:
- name: audio
dtype: audio
- name: file_name
dtype: string
- name: transcription
dtype: string
- name: audio_name
dtype: string
- name: ner
dtype: string
- name: speaker_id
dtype: int64
- name: speaker_age_range
dtype: string
- name: speaker_gender
dtype: string
- name: speaker_category
dtype: string
- name: drug
sequence: string
- name: d_dos_val
sequence: string
- name: d_dos_up
sequence: string
- name: dur_val
sequence: string
- name: dur_ut
sequence: string
- name: dos_val
sequence: string
- name: dos_uf
sequence: string
- name: rhythm_tdte
sequence: string
- name: rhythm_perday
sequence: string
- name: inn
sequence: string
- name: d_dos_form
sequence: string
- name: freq_ut
sequence: string
- name: rhythm_hour
sequence: string
- name: dos_cond
sequence: string
- name: qsp_val
sequence: string
- name: qsp_ut
sequence: string
- name: cma_event
sequence: string
- name: roa
sequence: string
- name: A
sequence: string
- name: max_unit_val
sequence: string
- name: max_unit_ut
sequence: string
- name: max_unit_uf
sequence: string
- name: d_dos_form_ext
sequence: string
- name: rhythm_rec_ut
sequence: string
- name: fasting
sequence: string
- name: freq_int_v1
sequence: string
- name: freq_int_v1_ut
sequence: string
- name: re_val
sequence: string
- name: re_ut
sequence: string
- name: freq_val
sequence: string
- name: freq_int_v2
sequence: string
- name: rhythm_rec_val
sequence: string
- name: min_gap_ut
sequence: string
- name: freq_startday
sequence: string
- name: freq_int_v2_ut
sequence: string
- name: min_gap_val
sequence: string
- name: freq_days
sequence: string
- name: medical_terms
sequence: string
splits:
- name: train
num_bytes: 252725374.904
num_examples: 1127
- name: test
num_bytes: 175599765.0
num_examples: 570
- name: dev
num_bytes: 61546023.0
num_examples: 283
download_size: 465682214
dataset_size: 489871162.90400004
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: dev
path: data/dev-*
tags:
- medical
---
# PxCorpus : A Spoken Drug Prescription Dataset in French
PxCorpus is to the best of our knowledge, the first spoken medical drug prescriptions corpus to be distributed.
It contains 4 hours of transcribed and annotated dialogues of drug prescriptions in
French acquired through an experiment with 55 participants experts and non-experts in drug prescriptions.
The automatic transcriptions were verified by human effort and aligned with
semantic labels to allow training of NLP models. The data acquisition protocol
was reviewed by medical experts and permit free distribution without breach of
privacy and regulation.
## Overview of the Corpus
The experiment has been performed in wild conditions with naive participants and medical experts.
In total, the dataset includes 2067 recordings of 55 participants (38% non-experts,
25% doctors, 36% medical practitioners), manually transcribed and semantically annotated.
| Category | Sessions | Recordings | Time(m)|
|------------------| -------- | ---------- | ------ |
| Medical experts | 258 | 434 | 94.83 |
| Doctors | 230 | 570 | 105.21 |
| Non experts | 415 | 977 | 62.13 |
| Total | 903 | 1981 | 262.27 |
## License
We hope that that the community will be able to benefit from the dataset
which is distributed with an attribution 4.0 International (CC BY 4.0) Creative Commons licence.
## How to cite this corpus
If you use the corpus or need more details please refer to the following paper: A spoken drug prescription datset in French for spoken Language Understanding
@InProceedings{Kocabiyikoglu2022,
author = "Alican Kocabiyikoglu and Fran{\c c}ois Portet and Prudence Gibert and Hervé Blanchon and Jean-Marc Babouchkine and Gaëtan Gavazzi",
title = "A spoken drug prescription datset in French for spoken Language Understanding",
booktitle = "13th Language Ressources and Evaluation Conference (LREC 2022)",
year = "2022",
location = "Marseille, France"
}
## Dataset features
* `path` -- Audio name
* `text` -- Audio utterance
* `ner` -- Semantic annotation from the original dataset
* `speaker_id` -- Speaker ID
* `speaker_age_range` -- Speaker age range
* `speaker_gender` -- Speaker gender
* `speaker_category` -- Speaker category (doctor, expert, non-expert)
* Other column names are for the occurences of each NER tag, could be useful for computing some metrics
提供机构:
bastiendechamps
原始信息汇总
数据集概述
名称: PxCorpus
语言: 法语 (fr)
许可: 知识共享署名 4.0 国际许可 (CC BY 4.0)
类别:
- 大小: 1K<n<10K
- 任务: 自动语音识别
描述: PxCorpus 是一个包含4小时法语医疗处方对话的语音数据集,由55名专家和非专家参与者参与。数据集包含手动转录和语义标注的录音,适用于训练NLP模型。
数据集特征
- audio: 音频数据
- file_name: 文件名,字符串类型
- transcription: 转录文本,字符串类型
- audio_name: 音频名称,字符串类型
- ner: 语义标注,字符串类型
- speaker_id: 说话人ID,整数类型
- speaker_age_range: 说话人年龄范围,字符串类型
- speaker_gender: 说话人性别,字符串类型
- speaker_category: 说话人分类(医生、专家、非专家),字符串类型
- drug: 药物信息,字符串序列类型
- d_dos_val: 药物剂量值,字符串序列类型
- d_dos_up: 药物剂量上限,字符串序列类型
- dur_val: 药物持续时间值,字符串序列类型
- dur_ut: 药物持续时间单位,字符串序列类型
- dos_val: 药物剂量值,字符串序列类型
- dos_uf: 药物剂量单位,字符串序列类型
- rhythm_tdte: 药物服用节奏,字符串序列类型
- rhythm_perday: 每日药物服用节奏,字符串序列类型
- inn: 国际非专利名称,字符串序列类型
- d_dos_form: 药物剂型,字符串序列类型
- freq_ut: 药物频率单位,字符串序列类型
- rhythm_hour: 药物服用小时,字符串序列类型
- dos_cond: 药物剂量条件,字符串序列类型
- qsp_val: 药物质量规格值,字符串序列类型
- qsp_ut: 药物质量规格单位,字符串序列类型
- cma_event: 药物事件,字符串序列类型
- roa: 药物给药途径,字符串序列类型
- A: 药物类别A,字符串序列类型
- max_unit_val: 最大单位值,字符串序列类型
- max_unit_ut: 最大单位单位,字符串序列类型
- max_unit_uf: 最大单位格式,字符串序列类型
- d_dos_form_ext: 药物剂型扩展,字符串序列类型
- rhythm_rec_ut: 推荐药物服用节奏单位,字符串序列类型
- fasting: 禁食状态,字符串序列类型
- freq_int_v1: 药物频率间隔版本1,字符串序列类型
- freq_int_v1_ut: 药物频率间隔版本1单位,字符串序列类型
- re_val: 药物再评估值,字符串序列类型
- re_ut: 药物再评估单位,字符串序列类型
- freq_val: 药物频率值,字符串序列类型
- freq_int_v2: 药物频率间隔版本2,字符串序列类型
- rhythm_rec_val: 推荐药物服用节奏值,字符串序列类型
- min_gap_ut: 最小间隔单位,字符串序列类型
- freq_startday: 药物开始日频率,字符串序列类型
- freq_int_v2_ut: 药物频率间隔版本2单位,字符串序列类型
- min_gap_val: 最小间隔值,字符串序列类型
- freq_days: 药物频率天数,字符串序列类型
- medical_terms: 医疗术语,字符串序列类型
数据集分割
| 分割 | 字节数 | 示例数 |
|---|---|---|
| train | 252725374.904 | 1127 |
| test | 175599765.0 | 570 |
| dev | 61546023.0 | 283 |
下载大小: 465682214字节
数据集大小: 489871162.90400004字节



