Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model

Name: Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model
Creator: British Geological Survey (BGS)
Published: 2024-12-08 00:00:00
License: 暂无描述

www.data.gov.uk2024-12-08 更新2025-01-09 收录

下载链接：

https://www.data.gov.uk/dataset/f3734577-1483-4105-8e32-3adf2cb11724/labelled-data-for-fine-tuning-a-geological-named-entity-recognition-and-entity-relation-extraction-model

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset consists of sentences extracted from BGS memoirs, technical reports, onshore hydrocarbons well reports and Mineral Reconnaisance Programme reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as "overlies", "observedIn". The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in "JSONL(Relation)" format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels, and an alternative jsonlines format used by open source model training software https://github.com/princeton-nlp/PURE The source documents are already publicly available, but some only in pdf image form. The dataset is currently a proof of concept, and of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604

本数据集由源自BGS回忆录、技术报告、陆上烃类井报告以及矿产资源勘测计划报告中的句子构成。这些句子已进行标注，以便本数据集可作为标注训练数据用于命名实体识别模型和实体关系提取模型，二者均为自然语言处理（NLP）技术，有助于从非结构化文本中提取结构化数据。所关注的实体包括岩层构造、地质年代、岩石类型、物理性质及地理位置，以及诸如“覆盖”、“观察到”等相互关系。数据可用于通过迁移学习微调预训练的大语言模型，从而创建可在推理模式下自动生成标签的模型，进而生成对地质建模和地下特性描述有用的结构化数据。数据以“JSONL（关系）”格式提供，该格式为doccano开源文本标注软件（https://doccano.github.io/doccano/）的导出格式，用于创建标签，以及由开源模型训练软件https://github.com/princeton-nlp/PURE使用的lines格式。源文档已公开可用，但其中一些仅以PDF图像形式存在。目前，该数据集尚处于概念验证阶段，数据量不足，无法构建性能优异的模型。该训练数据及其文本处理脚本的开发得到了英国政府技术转移办公室（GOTT）知识资产资助金项目10083604的支持。

提供机构：

British Geological Survey (BGS)

5,000+

优质数据集

54 个

任务类型

进入经典数据集