five

medical-transcription-instruct

收藏
魔搭社区2025-12-04 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/medical-transcription-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
<a href="https://www.datafog.ai"> <img src="https://www.datafog.ai/colorlogo.png" alt="DataFog logo" width="300"> </a> ## About This dataset consists of 38,924 samples of instruct-input-output data, most helpfully for training instruction-following models tailored to the medical field ### Dataset Summary - **Source**: Original medical transcriptions with added instruction-output pairs - **Size**: 38,924 instruction-output pairs - **Format**: CSV file - **Domain**: Medical / Healthcare - **Language**: English - **Last Updated**: 08-20-2024 ## Dataset Structure Each row in the dataset represents a unique instruction-output pair based on a medical transcription. The columns are organized as follows: 1. `instruction`: The task or question to be performed on the medical text 2. `task_output`: The expected output or answer for the given instruction 3. `transcription`: The original medical transcription text 4. `description`: A brief description or summary of the transcription 5. `medical_specialty`: The medical specialty associated with the transcription 6. `sample_name`: A name or identifier for the transcription sample 7. `keywords`: Original keywords associated with the transcription (if available) 8. `derived_keywords`: Automatically extracted keywords using TF-IDF 9. `transcription_length`: The character count of the transcription 10. `normalized_length`: The transcription length normalized to a 0-1 scale 11. `complexity_score`: A measure of the transcription's textual complexity ## Task Types The dataset includes various instruction types, such as: 1. Identifying medical specialties 2. Summarizing transcriptions 3. Extracting keywords 4. Assessing text complexity 5. Determining relative transcription length 6. Suggesting follow-up questions ## Intended Uses This dataset is suitable for: - Fine-tuning language models for medical text analysis - Developing instruction-following models in the healthcare domain - Research in medical natural language processing - Exploring various aspects of medical transcriptions ## Ethical Considerations - This dataset contains medical information. While it has been de-identified, users should be cautious about potential privacy concerns. - The data should not be used for making real-world medical decisions without proper validation and expert oversight. - Biases may exist in the original transcriptions or derived tasks. Users should be aware of potential biases in model outputs. ## Citation If you use this dataset in your research, please cite it as follows: ``` [Author(s)], [Year of Publication]. [Dataset Title]. DataFog. Available at: [URL]. Accessed on [Date of Access]. ``` ## Contact For questions or feedback about this dataset, please contact the author Sid Mohan: sid at datafog.ai. ## Acknowledgement This dataset was created using data from [Medical Transcriptions](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions) available on Kaggle under a CC0 license. The original dataset has been modified and transformed by DataFog. --- Dataset created by DataFog (Sid Mohan) and hosted on Hugging Face.

<a href="https://www.datafog.ai"><img src="https://www.datafog.ai/colorlogo.png" alt="DataFog 标志" width="300"></a> ## 关于本数据集 本数据集共包含38924条指令-输入-输出样本,最适用于训练面向医疗领域的指令遵循模型(instruction-following models)。 ### 数据集概览 - **来源**:基于原始医疗转录文本,新增指令-输出对构建而成 - **规模**:38924条指令-输出对 - **格式**:逗号分隔值(Comma-Separated Values,CSV)文件 - **领域**:医疗/健康保健 - **语言**:英语 - **最后更新时间**:2024年8月20日 ## 数据集结构 本数据集的每一行均代表一条基于医疗转录文本的唯一指令-输出对,各列结构如下: 1. `instruction`(指令):针对医疗文本需执行的任务或问题 2. `task_output`(任务输出):对应给定指令的预期输出或答案 3. `transcription`(转录文本):原始医疗转录文本 4. `description`(描述信息):该转录文本的简要说明或摘要 5. `medical_specialty`(医学专科):该转录文本所属的医学专科领域 6. `sample_name`(样本名称):该转录样本的名称或标识符 7. `keywords`(关键词):与该转录文本关联的原始关键词(如存在) 8. `derived_keywords`(衍生关键词):通过TF-IDF(词频-逆文档频率)算法自动提取的关键词 9. `transcription_length`(转录文本长度):该转录文本的字符数 10. `normalized_length`(归一化长度):归一化至0-1区间的转录文本长度 11. `complexity_score`(复杂度得分):衡量转录文本文本复杂度的指标 ## 任务类型 本数据集涵盖多种指令类型,例如: 1. 识别医学专科领域 2. 转录文本摘要生成 3. 关键词提取 4. 文本复杂度评估 5. 转录文本相对长度判定 6. 后续问题建议 ## 预期用途 本数据集适用于以下场景: - 面向医疗文本分析任务的大语言模型(Large Language Model,LLM)微调 - 开发医疗领域的指令遵循模型 - 开展医疗自然语言处理(Natural Language Processing,NLP)相关研究 - 探索医疗转录文本的各类特性 ## 伦理考量 - 本数据集包含医疗信息,尽管已完成去标识化处理,使用者仍需警惕潜在的隐私风险。 - 未经适当验证及专家监督,不得将本数据集用于现实场景中的医疗决策制定。 - 原始转录文本或衍生任务中可能存在偏差,使用者需留意模型输出中潜在的偏差问题。 ## 引用格式 若您在研究中使用本数据集,请按以下格式引用: [作者姓名], [发表年份]. [数据集名称]. DataFog. 访问链接: [URL]. 访问日期: [访问日期]. ## 联系方式 若对本数据集有任何疑问或反馈,请联系作者Sid Mohan:邮箱地址为sid@datafog.ai。 ## 致谢声明 本数据集基于Kaggle平台上以CC0许可发布的[Medical Transcriptions(医疗转录数据集)](https://www.kaggle.com/datasets/tboyle10/medicaltranscriptions)构建而成,原始数据集已由DataFog进行修改与转换。 --- 本数据集由DataFog(Sid Mohan)创建,并托管于Hugging Face平台。
提供机构:
maas
创建时间:
2024-08-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作