矿山大模型预训练语料数据集
收藏贵州省数据知识产权登记平台2025-11-28 更新2025-11-29 收录
下载链接:
https://gzdipp.gzsis.cn:12020/noticeDetail?id=1727&type=1
下载链接
链接失效反馈官方服务:
资源简介:
本规则严格遵循《信息安全技术 个人信息安全规范》(GB/T 35273-2020)及自然资源领域相关数据标准,旨在规范“矿山可信数据空间”内的所有数据处理活动。数据处理采用“统一汇聚、智能治理、分级应用”的技术路线。在加工工具层面,依托矿山可信数据空间作为核心底座,利用其内置的智能语义提取引擎与多模态大模型对地质报告、图件等非结构化数据进行自动化解析与知识抽取。对于数值型地物化遥数据,则通过数据转换-坐标重建-图层校正的处理管道进行时空对齐与标准化。
在涉及个人信息时,处理过程遵循 “数据分类-敏感识别-动态脱敏” 的全流程机制。我们采用基于知识图谱的实体识别算法自动定位姓名、单位等敏感信息,并应用 k-匿名与差分隐私 等模型进行严格的去标识化与匿名化处理,确保数据在“可用不可见”的前提下服务于大模型的训练与微调,最终形成覆盖十六种关键矿种、高质量、合规可用的探矿数据集。
This rule strictly complies with the *Information Security Technology - Personal Information Security Specification* (GB/T 35273-2020) and relevant data standards in the field of natural resources, and aims to standardize all data processing activities within the "Mine Trustworthy Data Space". The data processing adopts the technical route of "unified aggregation, intelligent governance, and hierarchical application". From the perspective of processing tools, the "Mine Trustworthy Data Space" is taken as the core foundation, and its built-in intelligent semantic extraction engine and multimodal large language model are used to perform automated parsing and knowledge extraction on unstructured data such as geological reports and maps. For numerical geophysical, geochemical, and remote sensing data, spatiotemporal alignment and standardization are performed through the processing pipeline of "data transformation - coordinate reconstruction - layer correction".
When processing personal information, the processing process follows the full-process mechanism of "data classification - sensitive identification - dynamic desensitization". We use knowledge graph-based entity recognition algorithms to automatically locate sensitive information such as names and work units, and apply models such as k-anonymity and differential privacy to perform strict de-identification and anonymization processing, ensuring that data can serve the training and fine-tuning of large language models under the premise of "available but not visible", and finally form a high-quality, compliant and usable prospecting dataset covering sixteen key mineral types.
提供机构:
四川省自然资源数字科技有限责任公司
创建时间:
2025-11-25
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是矿山大模型预训练语料数据集,数据规模为1GB,更新周期为天,主要用于矿山大模型预训练和生成式人工智能的知识增强检索。它基于公开收集的数据,遵循个人信息安全规范和自然资源标准,采用智能语义提取和匿名化处理技术,确保数据合规可用,并包含结构化字段如输入问题和答案对。
以上内容由遇见数据集搜集并总结生成



