中文合约要素提取数据集
收藏国家基础学科公共科学数据中心2026-03-14 收录
下载链接:
https://nbsdc.cn/general/dataDetail?id=69a5b444195d261dfe791b3d&type=1
下载链接
链接失效反馈官方服务:
资源简介:
为支撑智能合约化应用,本研究构建了用于中文合约要素提取的专项数据集。数据集基于从上海市、重庆市、湖北省等地方政府门户网站采集的真实合同文本,具体包括1,000篇买卖合同、303篇租赁合同及302篇保险合同。标注工作于2022年12月至2023年5月在西安电子科技大学完成。依据智能合约转换需求及三类合同的结构特征,定义了11个要素大类与30个要素小类,形成了粒度精细的要素体系。文本按句末标点与换行符自动切分为句子,采用“句子片段+要素类别标签”的结构进行标注,以贴合真实要素抽取流程。数据经自动化预处理与标注工具处理,并结合人工校准,确保了标注质量。该数据集为中文合同信息提取、智能合约生成等自然语言处理任务的研究与应用提供了高质量、场景化的数据资源,具有较强的实践应用价值与科研意义。
To support smart contract-based applications, this study constructs a specialized dataset for Chinese contract element extraction. The dataset is derived from real contract texts collected from official websites of local governments including Shanghai, Chongqing, Hubei and other regions, specifically comprising 1,000 sales contracts, 303 lease contracts and 302 insurance contracts. The annotation work was completed at Xidian University from December 2022 to May 2023. Based on the requirements of smart contract conversion and the structural features of the three types of contracts, 11 major element categories and 30 minor element categories are defined, forming a fine-grained element system. Texts are automatically split into sentences according to end-of-sentence punctuation and line breaks, and annotated with the structure of "sentence fragment + element category label" to align with the real element extraction workflow. The dataset is processed via automated preprocessing and annotation tools, combined with manual calibration to ensure annotation quality. This dataset provides high-quality, scenario-specific data resources for the research and application of natural language processing tasks such as Chinese contract information extraction and smart contract generation, with strong practical application value and scientific research significance.
提供机构:
西安电子科技大学
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集为支撑智能合约应用而构建,包含从地方政府网站采集的1600余份真实中文合同(涵盖买卖、租赁、保险三类),于2022-2023年完成标注。它定义了精细的要素体系,采用句子片段与标签结合的结构进行标注,为中文合同信息提取与智能合约生成研究提供了高质量、场景化的数据资源。
以上内容由遇见数据集搜集并总结生成



