medical-transcription-instruct
收藏魔搭社区2025-12-04 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/medical-transcription-instruct
下载链接
链接失效反馈官方服务:
资源简介:
<a href="https://www.datafog.ai">
<img src="https://www.datafog.ai/colorlogo.png" alt="DataFog logo" width="300">
</a>
## About
This dataset consists of 38,924 samples of instruct-input-output data, most helpfully for training instruction-following models tailored to the medical field
### Dataset Summary
- **Source**: Original medical transcriptions with added instruction-output pairs
- **Size**: 38,924 instruction-output pairs
- **Format**: CSV file
- **Domain**: Medical / Healthcare
- **Language**: English
- **Last Updated**: 08-20-2024
## Dataset Structure
Each row in the dataset represents a unique instruction-output pair based on a medical transcription. The columns are organized as follows:
1. `instruction`: The task or question to be performed on the medical text
2. `task_output`: The expected output or answer for the given instruction
3. `transcription`: The original medical transcription text
4. `description`: A brief description or summary of the transcription
5. `medical_specialty`: The medical specialty associated with the transcription
6. `sample_name`: A name or identifier for the transcription sample
7. `keywords`: Original keywords associated with the transcription (if available)
8. `derived_keywords`: Automatically extracted keywords using TF-IDF
9. `transcription_length`: The character count of the transcription
10. `normalized_length`: The transcription length normalized to a 0-1 scale
11. `complexity_score`: A measure of the transcription's textual complexity
## Task Types
The dataset includes various instruction types, such as:
1. Identifying medical specialties
2. Summarizing transcriptions
3. Extracting keywords
4. Assessing text complexity
5. Determining relative transcription length
6. Suggesting follow-up questions
## Intended Uses
This dataset is suitable for:
- Fine-tuning language models for medical text analysis
- Developing instruction-following models in the healthcare domain
- Research in medical natural language processing
- Exploring various aspects of medical transcriptions
## Ethical Considerations
- This dataset contains medical information. While it has been de-identified, users should be cautious about potential privacy concerns.
- The data should not be used for making real-world medical decisions without proper validation and expert oversight.
- Biases may exist in the original transcriptions or derived tasks. Users should be aware of potential biases in model outputs.
## Citation
If you use this dataset in your research, please cite it as follows:
```
[Author(s)], [Year of Publication]. [Dataset Title]. DataFog. Available at: [URL]. Accessed on [Date of Access].
```
## Contact
For questions or feedback about this dataset, please contact the author Sid Mohan: sid at datafog.ai.
## Acknowledgement
This dataset was created using data from [Medical Transcriptions](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions) available on Kaggle under a CC0 license. The original dataset has been modified and transformed by DataFog.
---
Dataset created by DataFog (Sid Mohan) and hosted on Hugging Face.
<a href="https://www.datafog.ai"><img src="https://www.datafog.ai/colorlogo.png" alt="DataFog 标志" width="300"></a>
## 关于本数据集
本数据集共包含38924条指令-输入-输出样本,最适用于训练面向医疗领域的指令遵循模型(instruction-following models)。
### 数据集概览
- **来源**:基于原始医疗转录文本,新增指令-输出对构建而成
- **规模**:38924条指令-输出对
- **格式**:逗号分隔值(Comma-Separated Values,CSV)文件
- **领域**:医疗/健康保健
- **语言**:英语
- **最后更新时间**:2024年8月20日
## 数据集结构
本数据集的每一行均代表一条基于医疗转录文本的唯一指令-输出对,各列结构如下:
1. `instruction`(指令):针对医疗文本需执行的任务或问题
2. `task_output`(任务输出):对应给定指令的预期输出或答案
3. `transcription`(转录文本):原始医疗转录文本
4. `description`(描述信息):该转录文本的简要说明或摘要
5. `medical_specialty`(医学专科):该转录文本所属的医学专科领域
6. `sample_name`(样本名称):该转录样本的名称或标识符
7. `keywords`(关键词):与该转录文本关联的原始关键词(如存在)
8. `derived_keywords`(衍生关键词):通过TF-IDF(词频-逆文档频率)算法自动提取的关键词
9. `transcription_length`(转录文本长度):该转录文本的字符数
10. `normalized_length`(归一化长度):归一化至0-1区间的转录文本长度
11. `complexity_score`(复杂度得分):衡量转录文本文本复杂度的指标
## 任务类型
本数据集涵盖多种指令类型,例如:
1. 识别医学专科领域
2. 转录文本摘要生成
3. 关键词提取
4. 文本复杂度评估
5. 转录文本相对长度判定
6. 后续问题建议
## 预期用途
本数据集适用于以下场景:
- 面向医疗文本分析任务的大语言模型(Large Language Model,LLM)微调
- 开发医疗领域的指令遵循模型
- 开展医疗自然语言处理(Natural Language Processing,NLP)相关研究
- 探索医疗转录文本的各类特性
## 伦理考量
- 本数据集包含医疗信息,尽管已完成去标识化处理,使用者仍需警惕潜在的隐私风险。
- 未经适当验证及专家监督,不得将本数据集用于现实场景中的医疗决策制定。
- 原始转录文本或衍生任务中可能存在偏差,使用者需留意模型输出中潜在的偏差问题。
## 引用格式
若您在研究中使用本数据集,请按以下格式引用:
[作者姓名], [发表年份]. [数据集名称]. DataFog. 访问链接: [URL]. 访问日期: [访问日期].
## 联系方式
若对本数据集有任何疑问或反馈,请联系作者Sid Mohan:邮箱地址为sid@datafog.ai。
## 致谢声明
本数据集基于Kaggle平台上以CC0许可发布的[Medical Transcriptions(医疗转录数据集)](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions)构建而成,原始数据集已由DataFog进行修改与转换。
---
本数据集由DataFog(Sid Mohan)创建,并托管于Hugging Face平台。
提供机构:
maas
创建时间:
2024-08-25



