five

故宫博物院馆藏文物多模态数据集

收藏
阿里云天池2026-06-09 更新2026-05-16 收录
下载链接:
https://tianchi.aliyun.com/dataset/226481
下载链接
链接失效反馈
官方服务:
资源简介:
本数据集是专为文化遗产领域多模态知识图谱构建、实体关系抽取及跨模态图文对齐研究打造的标准化学术数据集。数据集通过定向爬取故宫博物院官方网站公开可访问的权威资源生成,全面覆盖各类核心文化遗产,包含完整的文物基础信息、详细学术描述、标准化实体关系标注及高清图片下载链接。 本数据集严格遵循国际博物馆协会 CIDOC-CRM 文化遗产概念参考模型,对齐国家文物局《文物藏品档案规范》《文物元数据总则》等行业法定标准,标注质量经过严格的一致性检验,是目前国内规模最大、标注最规范的故宫文化遗产多模态数据集之一,可为数字人文、计算机视觉、自然语言处理等领域的研究提供高质量数据支撑。 本数据集包含两个核心数据文件,格式与字段说明如下: (1) 原始爬虫数据文件:pm_cultural_relics_raw_crawl_v1.0.txt 格式:纯文本文件,UTF-8 无 BOM 编码,每行一条数据 字段分隔符:英文逗号 (,),文本描述中的逗号已用转义字符 (\,) 处理 每行格式:文物名称,文物编号,文物描述 (2) 结构化 JSON 文件:palace_museum_cultural_relics_v1.0.json 格式:JSON 数组,每个元素对应一件文物的完整信息 字段说明: text:文物原始文本描述 id:文物唯一编号 spo_list:实体关系三元组列表,每个三元组包含subject(主体)、predicate(关系)、object(客体)、subject_type(主体类型)、object_type(客体类型) 本数据集未直接包含图片文件,仅提供官方高清图片下载链接,原因如下: 存储与下载效率:全部图片文件总大小约 1.2TB,直接打包上传会导致数据集体积过大,严重影响用户下载速度和平台存储效率 灵活性:不同用户对图片分辨率和格式的需求不同,提供 URL 链接可让用户根据自身需求选择性下载原图或不同尺寸的缩略图 数据时效性:图片资源由故宫博物院官方服务器提供,通过 URL 直接下载可确保用户获取到最新、最清晰的原始图片,避免因本地存储导致的图片过时或损坏 版权合规性:避免因直接分发图片文件产生的版权问题,所有图片版权归故宫博物院所有 本数据集采用按文物类别分层抽样的方式,按照 7:1:2 的比例划分为训练集、验证集与测试集,确保各子集内各类文化遗产类型的分布与整体数据集一致,避免数据分布偏差对模型评估结果的影响。

This dataset is a standardized academic dataset specifically developed for research on multimodal knowledge graph construction, entity relation extraction, and cross-modal image-text alignment in the cultural heritage domain. It is generated by targeted crawling of publicly accessible authoritative resources from the official website of the Palace Museum, comprehensively covering all types of core cultural heritage. The dataset contains complete basic information of cultural relics, detailed academic descriptions, standardized entity relation annotations, and high-definition image download links. This dataset strictly follows the CIDOC-CRM cultural heritage conceptual reference model of the International Council of Museums (ICOM), and aligns with industry statutory standards such as *Specification for Cultural Relic Collection Archives* and *General Principles for Cultural Relic Metadata* issued by the National Cultural Heritage Administration. The annotation quality has undergone strict consistency verification, making it one of the largest and most standardized multimodal datasets of Palace Museum cultural heritage in China currently, which can provide high-quality data support for research in fields such as digital humanities, computer vision, and natural language processing. This dataset contains two core data files, with their formats and field descriptions as follows: (1) Raw crawling data file: pm_cultural_relics_raw_crawl_v1.0.txt Format: Plain text file, UTF-8 encoding without BOM, one record per line. Field separator: English comma (,), and commas in text descriptions have been escaped with the escape character (,). Line format: Relic name, relic ID, relic description (2) Structured JSON file: palace_museum_cultural_relics_v1.0.json Format: JSON array, where each element corresponds to the complete information of one cultural relic. Field descriptions: - text: Original text description of the relic - id: Unique ID of the relic - spo_list: List of entity relation triplets, where each triplet includes subject, predicate, object, subject_type, and object_type This dataset does not directly include image files, but only provides official high-definition image download links, for the following reasons: 1. Storage and download efficiency: The total size of all image files is approximately 1.2 TB. Directly packaging and uploading them would result in an overly large dataset, which would severely affect user download speed and platform storage efficiency. 2. Flexibility: Different users have varying requirements for image resolution and format. Providing URL links allows users to selectively download original images or thumbnails of different sizes according to their own needs. 3. Data timeliness: Image resources are provided by the official server of the Palace Museum. Directly downloading via URLs ensures that users can obtain the latest and clearest original images, avoiding image obsolescence or damage caused by local storage. 4. Copyright compliance: It avoids copyright issues arising from direct distribution of image files, as all image copyrights belong to the Palace Museum. This dataset adopts a stratified sampling method based on relic categories, and is divided into training set, validation set, and test set at a ratio of 7:1:2. This ensures that the distribution of various cultural heritage types in each subset is consistent with that of the overall dataset, avoiding the impact of data distribution bias on model evaluation results.
提供机构:
阿里云天池
创建时间:
2026-05-13
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是为文化遗产多模态研究(如知识图谱构建、实体关系抽取和图文对齐)打造的标准化学术资源,基于故宫博物院官网权威信息生成,涵盖文物基础信息、描述、标注及图片链接。它遵循国际和行业标准,提供高质量数据支撑,包含原始文本和结构化JSON文件,图片仅通过链接提供以优化存储和版权合规,并按类别分层划分为训练、验证和测试集。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务