five

IN-Scientific

收藏
魔搭社区2025-09-29 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/IN-Scientific
下载链接
链接失效反馈
官方服务:
资源简介:
# 📥 IN-Scientific IN-Scientific: An Open Multimodal Interleaved Dataset for Scientific Knowledge Representation This project is a subproject of the 📌PIN project, focusing on the development of the largest scientific document multimodal dataset, which integrates both text and images. 📑: https://arxiv.org/abs/2406.13923 🤗: https://huggingface.co/datasets/m-a-p/PIN-14M ## Dataset statistics | Source | Content Images (#) | Content Images (Size GB) | Documents (#) | Documents (Size GB) | Llama3 Tokens (#) | |--------|--------------------|--------------------------|---------------|---------------------|------------------------| | IN-Arxiv | 3.947 M | 507.99 | 0.715 M | 37.38 | 11.752 B | | IN-PMC | 19.711 M | 2154.95 | 5.677 M | 219.00 | 57.500 B | | Total | 23.658 M | 2662.94 | 6.392 M | 256.38 | 69.252 B | ## Examples ### IN-Arxiv ```json { "id": "1407.4558", "meta": { "language": "en", "source": "Arxiv", "date_download": "2024-12-03" }, "quality_signals": { "doc_length": 55245, "num_imgs": 1, "llama3_tokens_count": 22007 }, "content_image": [ "content_image/1407.4558/x1.png" ], "md": "# Div First-Order System LL* (FOSLL*) for Second-Order Elliptic Partial Differential Equations \u2020\n[FOOTNOTE:\u2020][ENDFOOTNOTE]\n\nZhiqiang Cai\n\n Department of Mathematics ..." } ``` ### IN-PMC ```json { "id": "PMC3231073", "meta": { "date_download": "2024-12-14", "language": "en", "source": "PMC" }, "quality_signals": { "doc_length": 20785, "llama3_tokens_count": 4364, "num_imgs": 7 }, "content_image": [ "content_image/PMC3231073/sensors-10-10663-v2f1.jpg", "content_image/PMC3231073/sensors-10-10663-v2f2.jpg", "content_image/PMC3231073/sensors-10-10663-v2f3.jpg", "content_image/PMC3231073/sensors-10-10663-v2f4.jpg", "content_image/PMC3231073/sensors-10-10663-v2f5.jpg", "content_image/PMC3231073/sensors-10-10663-v2f6.jpg", "content_image/PMC3231073/sensors-10-10663-v2f7.jpg" ], "md": "# The Use of Helmholtz Resonance for Measuring the Volume of Liquids and Solids\n\n## Abstract\n\nAn experimental investigation was undertaken to ascertain the potential of using Helmholtz resonance for volume determination and the factors that may influence accuracy. The uses for a rapid non-interference volume measurement system range from agricultural produce and mineral sampling through to liquid fill measurements. By weighing the sample the density can also measured indirectly..." } ```

# 📥 IN-Scientific IN-Scientific:面向科学知识表示的开源多模态交错数据集 本项目是📌PIN项目的子课题,致力于构建当前规模最大的科学文献多模态数据集,整合了文本与图像两类数据。 📑: https://arxiv.org/abs/2406.13923 🤗: https://huggingface.co/datasets/m-a-p/PIN-14M ## 数据集统计 | 来源 | 内容图像(数量) | 内容图像(存储容量,GB) | 文献(数量) | 文献(存储容量,GB) | Llama3 Token(Token)数量 | |--------|--------------------|--------------------------|---------------|---------------------|------------------------| | IN-Arxiv | 3.947 百万 | 507.99 | 0.715 百万 | 37.38 | 117.52 亿 | | IN-PMC | 19.711 百万 | 2154.95 | 5.677 百万 | 219.00 | 575 亿 | | Total | 23.658 百万 | 2662.94 | 6.392 百万 | 256.38 | 692.52 亿 | ## 示例 ### IN-Arxiv json { "id": "1407.4558", "meta": { "language": "英语", "source": "Arxiv", "date_download": "2024-12-03" }, "quality_signals": { "doc_length": 55245, "num_imgs": 1, "llama3_tokens_count": 22007 }, "content_image": [ "content_image/1407.4558/x1.png" ], "md": "# 针对二阶椭圆偏微分方程的一阶系统LL*(FOSLL*)† [脚注:†][结束脚注] 蔡志强 数学系 ..." } ### IN-PMC json { "id": "PMC3231073", "meta": { "date_download": "2024-12-14", "language": "英语", "source": "PMC" }, "quality_signals": { "doc_length": 20785, "llama3_tokens_count": 4364, "num_imgs": 7 }, "content_image": [ "content_image/PMC3231073/sensors-10-10663-v2f1.jpg", "content_image/PMC3231073/sensors-10-10663-v2f2.jpg", "content_image/PMC3231073/sensors-10-10663-v2f3.jpg", "content_image/PMC3231073/sensors-10-10663-v2f4.jpg", "content_image/PMC3231073/sensors-10-10663-v2f5.jpg", "content_image/PMC3231073/sensors-10-10663-v2f6.jpg", "content_image/PMC3231073/sensors-10-10663-v2f7.jpg" ], "md": "# 利用亥姆霍兹共振测量液体与固体体积 ## 摘要 本研究开展了实验探究,以明确利用亥姆霍兹共振进行体积测定的潜力,以及可能影响测量精度的各类因素。快速无干扰体积测量系统的应用场景广泛,涵盖农产品与矿物采样、液体灌装测量等领域。通过对样品称重,还可间接实现密度的测定..." }
提供机构:
maas
创建时间:
2025-08-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作