five

Caduceus-Dataset

收藏
魔搭社区2025-11-12 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Caduceus-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Data Card</title> <link href="https://fonts.googleapis.com/css2?family=Quicksand:wght@400;500;600&display=swap" rel="stylesheet"> <style> body { font-family: 'Quicksand', sans-serif; background-color: #1A202C; color: #D8DEE9; margin: 0; padding: 0; font-size: 26px; background: linear-gradient(to bottom right, #1a1918, #7ab547); } p { padding-left: 10px } .container { width: 100%; margin: auto; background-color: rgb(255 255 255 / 1%); padding: 20px 30px 40px; padding-right: 32px; border-radius: 12px; box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2); backdrop-filter: blur(10px); border: 1px solid rgba(255, 255, 255, 0.05); background-color: rgb(0 0 0 / 75%) !important; } .header h1 { font-size: 28px; color: #fff; margin: 0; text-shadow: -1px -1px 0 #000, 1px -1px 0 #000, -1px 1px 0 #000, 1px 1px 0 #000; } .header { display: flex; align-items: center; justify-content: space-between; gap: 20px; } img { border-radius: 10px 10px 0 0!important; padding-left: 0px !important; max-width: 500px; height: auto; display: block; margin: 20px auto 0; } .header h1 { font-size: 28px; color: #ECEFF4; margin: 0; text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3); } .info { background-color: rgba(255, 255, 255, 0.05); color: #AEBAC7; border-radius: 12px; box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2); font-size: 14px; line-height: 1.6; margin-left: 5px; overflow-x: auto; margin-top: 40px; border: 1px solid rgba(255, 255, 255, 0.05); transition: background-color 0.6s ease; } .info img { width: 100%; border-radius: 10px 10px 0 0; margin-top: -20px; } a { color: #88C0D0; text-decoration: none; transition: color 0.3s ease; position: relative; } a:hover { color: #A3BE8C; text-decoration: none; } a::before { content: ''; position: absolute; width: 100%; height: 2px; bottom: 0; left: 0; background-color: #A3BE8C; visibility: hidden; transform: scaleX(0); transition: all 0.3s ease-in-out; } a:hover::before { visibility: visible; transform: scaleX(1); } .button { display: inline-block; background-color: #5E81AC; color: #E5E9F0; padding: 10px 20px; border-radius: 5px; cursor: pointer; text-decoration: none; transition: background-color 0.3s ease; } .button:hover { background-color: #81A1C1; } </style> </head> <body> <div class="container"> <div class="header"> <h1>Caduceus Project Dataset</h1> </div> <div class="info"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/NckjqdBE-gOPt8r0L_Apr.png" alt="Caduceus Project" style="border-radius: 10px;"> <p><strong>Creator:</strong> <a href="https://github.com/Kquant03" target="_blank">Kquant03</a></p> <div> <p><strong>About the Dataset:</strong> The Caduceus Project Dataset is a curated collection of scientific and medical protocols sourced from <a href="https://github.com/protocolsio/protocols" target="_blank">protocols.io</a> and converted from PDF to markdown format. This dataset aims to help models learn to read complicated PDFs by either using computer vision on the PDF file, or through processing the raw text directly. You can find the repository for the pipeline <a href="https://github.com/Kquant03/caduceus" target="_blank">here</a>.</p> <p><strong>Source Data:</strong></p> <ul> <li>Protocols from <a href="https://github.com/protocolsio/protocols" target="_blank">protocols.io</a></li> </ul> <p><strong>Key Features:</strong></p> <ul> <li>Carefully selected high-quality protocols</li> <li>Base64 encodings for potential vision training</li> <li>Guaranteed quality through hand processing the resulting data</li> </ul> <p><strong>Dataset Structure:</strong></p> <ul> <li><code>pdf_files/</code>: Contains the original PDF files of the selected protocols.</li> <li><code>markdown_files/</code>: Contains the individual markdown files converted from the selected PDF files.</li> <li><code>Caduceus_Data.jsonl/</code>: A JSONL file including an input field, a Base64 encoding of the PDF file, the raw text from the PDF, the formatted markdown output, and the name of the corresponding file.</li> </ul> <p><strong>License:</strong> The Caduceus Project Dataset is released under the <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">Creative Commons Attribution 4.0 International (CC BY 4.0) License</a>.</p> <p><strong>Acknowledgments:</strong> We would like to express our gratitude to the contributors of <a href="https://github.com/protocolsio/protocols" target="_blank">protocols.io</a> for providing the open-source repository of scientific and medical protocols that served as the foundation for this dataset.</p> </div> </div> </div> </body> </html>

# 双蛇杖项目数据集(Caduceus Project Dataset) ## 创建者:[Kquant03](https://github.com/Kquant03) ### 数据集简介: 双蛇杖项目数据集是从protocols.io精选的科学与医学实验方案合集,已从PDF格式转换为Markdown格式。本数据集旨在助力模型通过两种方式学习解读复杂PDF文档:一是对PDF文件应用计算机视觉技术,二是直接处理原始文本。相关处理流程的代码仓库可在此处获取。 ### 源数据: - 来自protocols.io的实验方案 ### 核心特性: - 精心遴选的高质量实验方案 - 包含可用于视觉训练的Base64编码文件 - 通过人工核验处理数据,确保质量可靠 ### 数据集结构: - `pdf_files/`:存储所遴选实验方案的原始PDF文件 - `markdown_files/`:存储从所选PDF文件转换得到的独立Markdown文件 - `Caduceus_Data.jsonl/`:一个JSONL格式文件,包含输入字段、PDF文件的Base64编码、PDF原始文本、格式化后的Markdown输出以及对应文件名 ### 授权协议: 本双蛇杖项目数据集采用知识共享署名4.0国际许可(Creative Commons Attribution 4.0 International (CC BY 4.0) License)进行发布。 ### 致谢: 谨向protocols.io的贡献者们致以谢意,感谢其提供开源的科学与医学实验方案仓库,为本数据集奠定了基础。
提供机构:
maas
创建时间:
2024-08-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作