five

Pile-PubMed_Abstracts

收藏
魔搭社区2025-11-09 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/OmniData/Pile-PubMed_Abstracts
下载链接
链接失效反馈
官方服务:
资源简介:
displayName: Pile-PubMed_Abstracts license: - MIT taskTypes: - Natural Language Generation - Language Modelling mediaTypes: - Text labelTypes: - English Corpus tags: [] publisher: - EleutherAI publishDate: '2023-07-18' publishUrl: https://pile.eleuther.ai/ paperUrl: '' --- # 数据介绍 ## 简介 Pile-PubMed Abstracts数据集是The Pile项目的一部分,用于语言模型的数据集。它是从PubMed数据库中提取的医学文摘数据。 PubMed是一个由美国国家医学图书馆(NLM)提供的生物医学文献数据库,包含了大量医学和生命科学领域的文献摘要和引用。Pile-PubMed Abstracts数据集利用这些文献摘要,提供了一个丰富的医学文本资源。 该数据集包含了来自不同医学领域的研究和论文的摘要,涵盖了广泛的主题,如疾病、治疗方法、药物研发、生物医学工程等。这些摘要通常包含了对研究目的、方法、结果和结论的简要描述。 Pile-PubMed Abstracts数据集的目的是为研究人员和开发者提供一个丰富的医学文本资源,用于开发和训练自然语言处理、信息提取、知识图谱等医学应用。 ## 数据内容 ### 数据说明 Pile-PubMed Abstracts涵盖了22.4G的数据。 ### 数据示例 ``` { "id": "262320334", "source_id": "", "doc_id": "191081622", "data_type": "text", "data_source": "pile", "data_url": "enwiki-c4-pile-ccnews", "content": "Assessing the validity of the geographic practice cost indexes.\nData on physician practice inputs were used to test the degree to which the geographic practice cost indexes (GPCIs) of the Medicare physician payment schedule reflect geographic variation in input prices. For purposes of this study, input quantity information was collected through the American Medical Association's Socioeconomic Monitoring System survey in 1990 and 1991. These data, along with practice expense information, were used to construct unit input prices. The GPCIs were correlated with input prices; however, \"real\" or GPCI-adjusted prices varied significantly across locations. We conclude that the GPCIs are useful, but imperfect measures of geographic differences in physician practice input prices.\n", "remark": { "pile_set_name": "PubMed Abstracts" }, "sub_path": "pubmed-abstracts/train" } ``` ## 引文 ``` @misc{conghui2022opendatalab, title={OpenDataLab: Empowering General Artificial Intelligence with Open Datasets}, author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin}, journal={https://opendatalab.com/}, year={2022} } ``` ## Download dataset :modelscope-code[]{type="git"}

显示名称:Pile-PubMed 摘要集(Pile-PubMed_Abstracts) 许可证:MIT 许可证 任务类型:自然语言生成(Natural Language Generation)、语言建模(Language Modelling) 媒体类型:文本 标签类型:英文语料库(English Corpus) 标签:无 发布方:EleutherAI 发布日期:2023年7月18日 发布网址:https://pile.eleuther.ai/ 论文网址:无 --- # 数据集介绍 ## 简介 Pile-PubMed 摘要集是 The Pile 项目的组成部分,是专为语言模型研发构建的数据集。其数据源自 PubMed 数据库——由美国国家医学图书馆(National Library of Medicine, NLM)运营的生物医学文献数据库,涵盖海量医学与生命科学领域的文献摘要及引用信息。 该数据集收录了跨多医学领域的研究与论文摘要,覆盖疾病、诊疗方案、药物研发、生物医学工程等广泛主题,每条摘要均包含研究目的、实验方法、核心结果与结论的简要阐述。 本数据集旨在为研究人员与开发者提供高质量的医学文本资源,用于自然语言处理、信息抽取、知识图谱(Knowledge Graph)等医学相关应用的开发与模型训练。 ## 数据内容 ### 数据说明 Pile-PubMed 摘要集总数据量达22.4吉字节(GB)。 ### 数据示例 { "id": "262320334", "source_id": "", "doc_id": "191081622", "data_type": "text", "data_source": "pile", "data_url": "enwiki-c4-pile-ccnews", "content": "Assessing the validity of the geographic practice cost indexes. Data on physician practice inputs were used to test the degree to which the geographic practice cost indexes (GPCIs) of the Medicare physician payment schedule reflect geographic variation in input prices. For purposes of this study, input quantity information was collected through the American Medical Association's Socioeconomic Monitoring System survey in 1990 and 1991. These data, along with practice expense information, were used to construct unit input prices. The GPCIs were correlated with input prices; however, "real" or GPCI-adjusted prices varied significantly across locations. We conclude that the GPCIs are useful, but imperfect measures of geographic differences in physician practice input prices. ", "remark": { "pile_set_name": "PubMed Abstracts" }, "sub_path": "pubmed-abstracts/train" } ## 引文 @misc{conghui2022opendatalab, title={OpenDataLab:以开放数据集赋能通用人工智能}, author={Conghui He, Wei Li, Zhenjiang Jin, Bin Wang, Chao Xu, Dahua Lin}, journal={https://opendatalab.com/}, year={2022} } ## 数据集下载 :modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-07-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作