PIN-14M
收藏数据集概述
基本信息
- 数据集名称: PIN-14M
- 版本: 迷你版
- 来源: 源自 "PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents"
- 样本数量: 1400万
- 语言: 英语 (en) 和 中文 (zh)
- 标签: 多模态
- 大小: 10M<n<100M
- 许可证: Apache-2.0
数据集配置
- 配置名称: pin
- 数据文件:
- 分割: 训练
- 路径: data/DocLayNet/DocLayNet.jsonl
数据集统计
| 子集 | 文档数量 (#) | 总体图像数量 (#) | 内容图像数量 (#) | 文档大小 (GB) | 总体图像大小 (GB) | 内容图像大小 (GB) |
|---|---|---|---|---|---|---|
| pg19 | 2,612,285 | 2,608,029 | 0 | 12.3 | 1,418.1 | 0.0 |
| OBELICS | 5,795,198 | 5,770,432 | 5,840,658 | 13.0 | 3,141.4 | 3,305.3 |
| mmc4-core-ff | 5,351,628 | 5,277,983 | 9,014,579 | 33.7 | 3,232.0 | 5,605.0 |
| chinese-markdown | 168,323 | 167,989 | 106,768 | 1.3 | 773.2 | 15.0 |
| leetcode | 2,360 | 2,360 | 0 | 0.016 | 1.3 | 0.0 |
| linux-cn | 9,564 | 9,564 | 38,960 | 0.082 | 11.9 | 1.8 |
| DocLayNet | 68,757 | 69,375 | 90,259 | 0.18 | 25.9 | 1.6 |
| PIN-PMC | 99,157 | 1,074,799 | 454,482 | 2.8 | 724.2 | 29.5 |
| 总计 | 14,107,272 | 14,980,531 | 15,545,706 | 63.4 | 9,328.0 | 8,958.3 |
数据结构
子集
- 包含8个子集: PIN-PMC, DocLayNet, Linux-CN, chinese-markdown, OBELICS, MMC4, leetcode, PG19
文件夹结构
- 内容图像: 包含所有在markdown文件中提到的图像
- 总体图像: 显示markdown文件的整体视觉表示
- JSONL文件: 包含文本内容和相关数据细节
内容图像文件夹
- 所有图像需转换为PNG格式,文件名需唯一
总体图像文件夹
- 所有图像需转换为PNG格式,文件名需唯一
JSON Lines格式
- 字段描述:
- id: 唯一标识符
- meta: 元数据
- language: 文档语言
- source_dataset: 原始数据集名称
- doc_id: 文档标识符
- page_id: 页面标识符
- date_download: 下载日期
- ori_meta: 原始元数据
- oi_exist: 是否存在总体图像
- oi_source: 总体图像来源
- quality_signals: 质量指标
- doc_length: 文档长度
- content_image: 内容图像列表
- overall_image: 总体图像路径
- md: markdown内容
- license: 许可证信息
示例
DocLayNet示例
json { "id": 0, "meta": { "language": "en", "oi_exist": true, "oi_source": "ori", "source_dataset": "DocLayNet", "ori_meta": null, "doc_id": "NYSE_F_2004.pdf", "page_id": "0", "date_download": "2024-3-24" }, "quality_signals": null, "license": "https://cdla.io/permissive-1-0/", "content_image": [ "content_image/34102.jpg" ], "overall_image": "overall_image/3562e47265520f7a72f3eac73aadfe19a78531698c3b50d7670b8ad9b214106b.png", "md": "<img src=content_image/34102.jpg>
Ford Motor Company / 2004 Annual Report
R W A R D F O R W A R D
" }
OBELICS示例
json { "id": 466502, "meta": { "language": "en", "oi_exist": true, "oi_source": "compiling", "source_dataset": "OBELICS", "ori_meta": { "document_url": "https://www.donegaldaily.com/2022/02/21/watch-incredible-storm-surge-at-portsalon-golf-club/", "unformatted_src": "https://www.donegaldaily.com/wp-content/uploads/2022/02/Screenshot-2022-02-21-at-17.54.30.jpg", "src": "https://www.donegaldaily.com/wp-content/uploads/2022/02/Screenshot-2022-02-21-at-17.54.30.jpg", "formatted_filename": "Screenshot at", "rendered_width": 817, "rendered_height": 419, "original_width": 817, "original_height": 419, "format": "jpeg", "general_meta": { "url": "https://www.donegaldaily.com/2022/02/21/watch-incredible-storm-surge-at-portsalon-golf-club/", "warc_filename": "crawl-data/CC-MAIN-2022-27/segments/1656103271864.14/warc/CC-MAIN-20220626192142-20220626222142-00308.warc.gz", "warc_record_offset": 795020636, "warc_record_length": 31271 } }, "doc_id": 98496, "page_id": 0, "date_download": "2024-4-22" }, "md": "<img src=content_image/98496-0.png>
The golf course at Portsalon Golf Club took a battering today as a result of Storm Franklin.
Donegal had been left battered and bruised overnight after Storm Franklin ripped across the county.
There were trees down on the approach roads to Donegal Town and in Gartan.
There were also trees down in Inishowen while there is also heavy water reported along the sides of roads with motorists asked to slow down and not put themselves in danger.
Donegal’s coastline took a huge impact with massive waves reported along the coastline around the county.
The video, taken by Johnny Shields was taken from the tee box of the third hole.", "license": "CC-BY-4.0", "quality_signals": null, "content_image": [ "content_image/98496-0.png" ], "overall_image": "overall_image/98496-0.png" }
chinese-markdown示例
json { "id": 7, "meta": { "language": "zh", "oi_exist": true, "oi_source": "compiling", "source_dataset": "chinese-markdown", "ori_meta": null, "doc_id": 7, "page_id": null, "date_download": "2024-04-30" }, "md": "--- title: 常见问题 QA category: 其它 order: 1
持续更新中... 如有问题可以到 https://github.com/alibaba/ice/issues/new 反馈
ICE 的浏览器兼容策略是什么
由于 ICE 优先使用 React 16+,其需要的最低 IE 版本为 11,如果您需要在以下的版本使用,您可能需要引入一些 polyfill 来支持 Map, Set 等特性。参考React 官网说明。
以下代码可以帮助你在低版本 IE 下自动跳转到我们提供的提示浏览器升级页面。当然您也可以使用自定义的浏览器升级页面。
<!--[if lt IE 11]> <script>location.href = "//www.taobao.com/markets/tbhome/ali-page-updater"; </script> <![endif]-->
添加如上代码后,如果使用 IE11 及以下浏览器访问页面,则会自动跳转到统一引导升级浏览器的页面。
WebStorm/IDEA 编辑器卡顿现象
由于项目在安装依赖后,产生文件夹 node_modules 含有较多的碎小文件,编辑器在索引文件引起的卡顿。
WebStorm 中尤为明显,可通过 exclude node_modules 目录,不需要检索该文件夹下的内容。
如何设置网页在浏览器 Tab 上面的 Icon (favicon)
细心的同学可能会看到页面在浏览器 Tab 上面会有自定义的 Icon:

如果你想要在自己站点上面加上这个 Icon 可以按照如下步骤添加:
- 准备一个 Icon,文件格式可以为
.png或者.ico,正方形,分辨率可以是 32x32px 或者 64x64px 文件体积要求尽可能小。 - 上传 CDN 拿到一个 url 或者在自己服务器配置静态资源服务
- 在 HTML 页面
<head>标签里面添加如下代码:<link rel="shortcut icon" href="your-icon-url">
这样就添加成功啦!
如何在页面显示原始的 HTML 内容
出于安全方面的考虑,React 默认会将节点中 html 代码进行转义,比如:
jsx class Demo extends Component { render() { const content = hello <span>world</span>; return <div>{content}</div>; } }
// 输出 hello <span>world</span>
如上,<span> 标签并不会在页面上被解析,而是被当成字符串输出了。React 提供了 dangerouslySetInnerHTML 属性帮助我们进行类似 innerHTML 的操作:
jsx class Demo extends Component { render() { const content = hello <span>world</span>; return <div dangerouslySetInnerHTML={{ __html: content }} />; } }
// 输出 hello world
更多内容请参考 Dangerously Set innerHTML
之前创建的项目,遇到如下报错怎么办

这是由于 ES6 Modules 的标准在物料中不兼容导致的。您可以把 src/navs.js 中最后一行修改为:
js export const headerNavs = transform([ ...autoGenHeaderNavs, ...customHeaderNavs, ]);
export const asideNavs = transform([...autoGenAsideNavs, ...customAsideNavs]); ", "license": "MIT", "quality_signals": null, "content_image": [ "content_image/7-0.png" ], "overall_image": "overall_image/7.png" }
leetcode示例
json { "id": 1, "meta": { "language": "en", "doc_id": 1, "page_id": null, "oi_exist": true, "oi_source": "compiling", "source_dataset": "leetcode", "date_download": "2024-05-05", "ori_meta": { "slug": "two-sum", "difficulty": "Easy" } }, "quality_signals": null, "license": "MIT", "content_image": null, "md": "# Two Sum
- slug: two-sum
- difficulty: Easy
Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
You can return the answer in any order.
Example 1:
Input: nums = [2,7,11,15], target = 9 Output: [0,1] Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].
Example 2:
Input: nums = [3,2,4], target = 6 Output: [1,2]
Example 3:
Input: nums = [3,3], target = 6 Output: [0,1]
Constraints:
2 <= nums.length <= 104-109 <= nums[i] <= 109-109 <= target <= 109- Only one valid answer exists.
Follow-up: Can you come up with an algorithm that is less than O(n2) time complexity?
A solution in Java
java import java.util.HashMap; import java.util.Map;
public int[] twoSum(int[] nums, int target) { Map<Integer, Integer> map = new HashMap<>(); for (int i = 0; i < nums.length; i++) { int complement = target - nums[i]; if (map.containsKey(complement)) { return new int[]{map.get(complement), i};

- 1PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents多元艺术投影 · 2024年




