medical-transcription-instruct

Name: medical-transcription-instruct
Creator: maas
Published: 2025-12-04 16:16:48
License: 暂无描述

魔搭社区2025-12-04 更新2024-08-31 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/medical-transcription-instruct

下载链接

链接失效反馈

官方服务：

资源简介：

<a href="https://www.datafog.ai"> <img src="https://www.datafog.ai/colorlogo.png" alt="DataFog logo" width="300"> </a> ## About This dataset consists of 38,924 samples of instruct-input-output data, most helpfully for training instruction-following models tailored to the medical field ### Dataset Summary - **Source**: Original medical transcriptions with added instruction-output pairs - **Size**: 38,924 instruction-output pairs - **Format**: CSV file - **Domain**: Medical / Healthcare - **Language**: English - **Last Updated**: 08-20-2024 ## Dataset Structure Each row in the dataset represents a unique instruction-output pair based on a medical transcription. The columns are organized as follows: 1. `instruction`: The task or question to be performed on the medical text 2. `task_output`: The expected output or answer for the given instruction 3. `transcription`: The original medical transcription text 4. `description`: A brief description or summary of the transcription 5. `medical_specialty`: The medical specialty associated with the transcription 6. `sample_name`: A name or identifier for the transcription sample 7. `keywords`: Original keywords associated with the transcription (if available) 8. `derived_keywords`: Automatically extracted keywords using TF-IDF 9. `transcription_length`: The character count of the transcription 10. `normalized_length`: The transcription length normalized to a 0-1 scale 11. `complexity_score`: A measure of the transcription's textual complexity ## Task Types The dataset includes various instruction types, such as: 1. Identifying medical specialties 2. Summarizing transcriptions 3. Extracting keywords 4. Assessing text complexity 5. Determining relative transcription length 6. Suggesting follow-up questions ## Intended Uses This dataset is suitable for: - Fine-tuning language models for medical text analysis - Developing instruction-following models in the healthcare domain - Research in medical natural language processing - Exploring various aspects of medical transcriptions ## Ethical Considerations - This dataset contains medical information. While it has been de-identified, users should be cautious about potential privacy concerns. - The data should not be used for making real-world medical decisions without proper validation and expert oversight. - Biases may exist in the original transcriptions or derived tasks. Users should be aware of potential biases in model outputs. ## Citation If you use this dataset in your research, please cite it as follows: ``` [Author(s)], [Year of Publication]. [Dataset Title]. DataFog. Available at: [URL]. Accessed on [Date of Access]. ``` ## Contact For questions or feedback about this dataset, please contact the author Sid Mohan: sid at datafog.ai. ## Acknowledgement This dataset was created using data from [Medical Transcriptions](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions) available on Kaggle under a CC0 license. The original dataset has been modified and transformed by DataFog. --- Dataset created by DataFog (Sid Mohan) and hosted on Hugging Face.

<a href="https://www.datafog.ai"><img src="https://www.datafog.ai/colorlogo.png" alt="DataFog 标志" width="300"></a> ## 关于本数据集本数据集共包含38924条指令-输入-输出样本，最适用于训练面向医疗领域的指令遵循模型（instruction-following models）。 ### 数据集概览 - **来源**：基于原始医疗转录文本，新增指令-输出对构建而成 - **规模**：38924条指令-输出对 - **格式**：逗号分隔值（Comma-Separated Values，CSV）文件 - **领域**：医疗/健康保健 - **语言**：英语 - **最后更新时间**：2024年8月20日 ## 数据集结构本数据集的每一行均代表一条基于医疗转录文本的唯一指令-输出对，各列结构如下： 1. `instruction`（指令）：针对医疗文本需执行的任务或问题 2. `task_output`（任务输出）：对应给定指令的预期输出或答案 3. `transcription`（转录文本）：原始医疗转录文本 4. `description`（描述信息）：该转录文本的简要说明或摘要 5. `medical_specialty`（医学专科）：该转录文本所属的医学专科领域 6. `sample_name`（样本名称）：该转录样本的名称或标识符 7. `keywords`（关键词）：与该转录文本关联的原始关键词（如存在） 8. `derived_keywords`（衍生关键词）：通过TF-IDF（词频-逆文档频率）算法自动提取的关键词 9. `transcription_length`（转录文本长度）：该转录文本的字符数 10. `normalized_length`（归一化长度）：归一化至0-1区间的转录文本长度 11. `complexity_score`（复杂度得分）：衡量转录文本文本复杂度的指标 ## 任务类型本数据集涵盖多种指令类型，例如： 1. 识别医学专科领域 2. 转录文本摘要生成 3. 关键词提取 4. 文本复杂度评估 5. 转录文本相对长度判定 6. 后续问题建议 ## 预期用途本数据集适用于以下场景： - 面向医疗文本分析任务的大语言模型（Large Language Model，LLM）微调 - 开发医疗领域的指令遵循模型 - 开展医疗自然语言处理（Natural Language Processing，NLP）相关研究 - 探索医疗转录文本的各类特性 ## 伦理考量 - 本数据集包含医疗信息，尽管已完成去标识化处理，使用者仍需警惕潜在的隐私风险。 - 未经适当验证及专家监督，不得将本数据集用于现实场景中的医疗决策制定。 - 原始转录文本或衍生任务中可能存在偏差，使用者需留意模型输出中潜在的偏差问题。 ## 引用格式若您在研究中使用本数据集，请按以下格式引用： [作者姓名], [发表年份]. [数据集名称]. DataFog. 访问链接: [URL]. 访问日期: [访问日期]. ## 联系方式若对本数据集有任何疑问或反馈，请联系作者Sid Mohan：邮箱地址为sid@datafog.ai。 ## 致谢声明本数据集基于Kaggle平台上以CC0许可发布的[Medical Transcriptions（医疗转录数据集）](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions)构建而成，原始数据集已由DataFog进行修改与转换。 --- 本数据集由DataFog（Sid Mohan）创建，并托管于Hugging Face平台。

提供机构：

maas

创建时间：

2024-08-25

搜集汇总

数据集介绍