DehydratedWater42/semantic_relations_extraction
收藏数据集卡片 for "Semantic Relations Extraction"
数据集描述
目的
"Semantic Relations Extraction" 数据集旨在用于微调较小的 LLama2 (7B) 模型,以加速和降低从文本中提取实体间语义关系的成本。该数据集是构建低成本文档预处理系统的一部分,用于构建用于问答和自动警报的知识图谱。
数据来源
数据集基于以下源构建:
数据集文件
数据集包含以下文件:
extracted_relations.csv-> 包含实体间生成关系的数据集,包含以下列:[summary,article part,output json,database,abstract,list_of_contents]。core_extracted_relations.csv-> 与extracted_relations.csv相同,但没有原始摘要和目录列表,包含以下列:[summary,article part,output json]。llama2_prompts.csv-> 包含用于微调模型的多个提示变体及其响应,由core_extracted_relations.csv中的数据串联生成。synthetic_data_12_02_24-full.dump-> 数据生成过程中使用的完整 PostgreSQL 数据库的备份,由airflow用户以自定义格式导出,压缩级别为6,使用UTF-8编码。
数据库架构
数据集包括数据库架构图,展示了数据在数据库中的组织方式。
数据生成过程
数据基于 datasets/scientific_papers 数据集生成。生成过程概述如下:
- 将所有摘要和目录列表插入数据库。
- 将每篇文章的主要内容分割成重叠的1k LLaMA 令牌段,重叠部分为200令牌。
- 使用 LLaMA 13b 对10k个摘要和目录列表进行总结。
- 将生成的摘要和分割文本段转换为未处理的 JSON。
- 验证并清理所有生成的 JSON。
- 将验证后的 JSON 重新格式化为可用于微调的数据集。
输出数据示例
json { "section_description": "The article discusses the current reversal phenomenon in a classical deterministic ratchet system. The authors investigate the relationship between current and bifurcation diagrams, focusing on the dynamics of an ensemble of particles. They challenge Mateos claim that current reversals occur only with bifurcations and present evidence for current reversals without bifurcations. Additionally, they show that bifurcations can occur without current reversals. The study highlights the importance of considering the characteristics of the ensemble in understanding the behavior of the system. The authors provide numerical evidence to support their claims and suggest that correlating abrupt changes in the current with bifurcations is more appropriate than focusing solely on current reversals.", "list_of_entities": [ "reversals", "mateos", "figures", "rules", "current_reversal", "ensemble", "bifurcation", "jumps", "thumb", "spikes", "current", "particles", "open_question", "behavior", "heuristics", "direction", "chaotic", "parameter" ], "relations": [ { "description": "bifurcations in single - trajectory behavior often corresponds to sudden spikes or jumps in the current for an ensemble in the same system", "source_entities": [ "bifurcation" ], "target_entities": [ "current" ] }, { "description": "current reversals are a special case of this", "source_entities": [ "current" ], "target_entities": [ "bifurcation" ] }, { "description": "not all spikes or jumps correspond to a bifurcation", "source_entities": [ "spikes" ], "target_entities": [ "bifurcation" ] }, { "description": "the open question is clearly to figure out if the reason for when these rules are violated or are valid can be made more concrete", "source_entities": [ "current" ], "target_entities": [ "open_question" ] } ] }
预期输出 JSON 架构
json { "$schema": "extraction_schema.json", "type": "object", "properties": { "section_description": { "type": "string" }, "list_of_entities": { "type": "array", "items": { "type": "string" } }, "relations": { "type": "array", "items": { "type": "object", "properties": { "description": { "type": "string" }, "source_entities": { "type": "array", "items": { "type": "string" } }, "target_entities": { "type": "array", "items": { "type": "string" } }, "strength": { "type": "string", "enum": ["strong", "moderate", "weak"] } }, "required": ["description", "source_entities", "target_entities"] } } }, "required": ["list_of_entities", "relations", "section_description"] }
预处理微调数据示例
该文档详细说明了 llama2-prompts.csv 文件中微调数据的预处理过程,展示了六种不同的提示格式,旨在探索语义关系提取任务中训练模型的最佳结构:
prompt_with_summary_and_schema:包含内容的简明摘要和预期 JSON 的结构化架构。prompt_with_summary:仅包含内容的摘要,没有明确的架构。prompt_with_merged_text:将摘要和提取文本合并为一个统一的文本块。prompt_with_merged_text_and_schema:结合合并文本方法和架构以指导提取过程。prompt_no_summary_with_schema:不包含摘要,但包含架构,强调 JSON 结构。prompt_no_summary:提供原始数据,没有任何摘要或架构,模型需要从输出中学习架构。
这些变体基于相同的基础数据,但通过结构修改或部分内容的省略进行区分。需要实证测试以确定模型有效学习和执行提取任务所需的结构和指导程度。这种方法旨在识别信息完整性和处理效率之间的最佳数据呈现格式,从而增强模型在语义关系提取中的学习效果。



