IN-Scientific
收藏魔搭社区2025-09-29 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/IN-Scientific
下载链接
链接失效反馈官方服务:
资源简介:
# 📥 IN-Scientific
IN-Scientific: An Open Multimodal Interleaved Dataset for Scientific Knowledge Representation
This project is a subproject of the 📌PIN project, focusing on the development of the largest scientific document multimodal dataset, which integrates both text and images.
📑: https://arxiv.org/abs/2406.13923
🤗: https://huggingface.co/datasets/m-a-p/PIN-14M
## Dataset statistics
| Source | Content Images (#) | Content Images (Size GB) | Documents (#) | Documents (Size GB) | Llama3 Tokens (#) |
|--------|--------------------|--------------------------|---------------|---------------------|------------------------|
| IN-Arxiv | 3.947 M | 507.99 | 0.715 M | 37.38 | 11.752 B |
| IN-PMC | 19.711 M | 2154.95 | 5.677 M | 219.00 | 57.500 B |
| Total | 23.658 M | 2662.94 | 6.392 M | 256.38 | 69.252 B |
## Examples
### IN-Arxiv
```json
{
"id": "1407.4558",
"meta": {
"language": "en",
"source": "Arxiv",
"date_download": "2024-12-03"
},
"quality_signals": {
"doc_length": 55245,
"num_imgs": 1,
"llama3_tokens_count": 22007
},
"content_image": [
"content_image/1407.4558/x1.png"
],
"md": "# Div First-Order System LL* (FOSLL*) for Second-Order Elliptic Partial Differential Equations \u2020\n[FOOTNOTE:\u2020][ENDFOOTNOTE]\n\nZhiqiang Cai\n\n Department of Mathematics ..."
}
```
### IN-PMC
```json
{
"id": "PMC3231073",
"meta": {
"date_download": "2024-12-14",
"language": "en",
"source": "PMC"
},
"quality_signals": {
"doc_length": 20785,
"llama3_tokens_count": 4364,
"num_imgs": 7
},
"content_image": [
"content_image/PMC3231073/sensors-10-10663-v2f1.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f2.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f3.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f4.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f5.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f6.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f7.jpg"
],
"md": "# The Use of Helmholtz Resonance for Measuring the Volume of Liquids and Solids\n\n## Abstract\n\nAn experimental investigation was undertaken to ascertain the potential of using Helmholtz resonance for volume determination and the factors that may influence accuracy. The uses for a rapid non-interference volume measurement system range from agricultural produce and mineral sampling through to liquid fill measurements. By weighing the sample the density can also measured indirectly..."
}
```
# 📥 IN-Scientific
IN-Scientific:面向科学知识表示的开源多模态交错数据集
本项目是📌PIN项目的子课题,致力于构建当前规模最大的科学文献多模态数据集,整合了文本与图像两类数据。
📑: https://arxiv.org/abs/2406.13923
🤗: https://huggingface.co/datasets/m-a-p/PIN-14M
## 数据集统计
| 来源 | 内容图像(数量) | 内容图像(存储容量,GB) | 文献(数量) | 文献(存储容量,GB) | Llama3 Token(Token)数量 |
|--------|--------------------|--------------------------|---------------|---------------------|------------------------|
| IN-Arxiv | 3.947 百万 | 507.99 | 0.715 百万 | 37.38 | 117.52 亿 |
| IN-PMC | 19.711 百万 | 2154.95 | 5.677 百万 | 219.00 | 575 亿 |
| Total | 23.658 百万 | 2662.94 | 6.392 百万 | 256.38 | 692.52 亿 |
## 示例
### IN-Arxiv
json
{
"id": "1407.4558",
"meta": {
"language": "英语",
"source": "Arxiv",
"date_download": "2024-12-03"
},
"quality_signals": {
"doc_length": 55245,
"num_imgs": 1,
"llama3_tokens_count": 22007
},
"content_image": [
"content_image/1407.4558/x1.png"
],
"md": "# 针对二阶椭圆偏微分方程的一阶系统LL*(FOSLL*)†
[脚注:†][结束脚注]
蔡志强
数学系 ..."
}
### IN-PMC
json
{
"id": "PMC3231073",
"meta": {
"date_download": "2024-12-14",
"language": "英语",
"source": "PMC"
},
"quality_signals": {
"doc_length": 20785,
"llama3_tokens_count": 4364,
"num_imgs": 7
},
"content_image": [
"content_image/PMC3231073/sensors-10-10663-v2f1.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f2.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f3.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f4.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f5.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f6.jpg",
"content_image/PMC3231073/sensors-10-10663-v2f7.jpg"
],
"md": "# 利用亥姆霍兹共振测量液体与固体体积
## 摘要
本研究开展了实验探究,以明确利用亥姆霍兹共振进行体积测定的潜力,以及可能影响测量精度的各类因素。快速无干扰体积测量系统的应用场景广泛,涵盖农产品与矿物采样、液体灌装测量等领域。通过对样品称重,还可间接实现密度的测定..."
}
提供机构:
maas
创建时间:
2025-08-27



