政策大模型数据

Name: 政策大模型数据
Creator: 浙数城市大脑（浙江）有限公司
Published: 2024-11-26 12:49:12
License: 暂无描述

浙江省数据知识产权登记平台2024-11-26 更新2024-11-27 收录

下载链接：

https://www.zjip.org.cn/home/announce/trends/89289

下载链接

链接失效反馈

官方服务：

资源简介：

“政参谋”政策大模型面向政企研人员，基于检索增强生成（RAG）和生成式AI技术，聚焦公文辅助写作、政策计算器、生成式问答等能力，提供政策类智能产品和解决方案，帮助政府机构、企业组织等机构提升相关政策服务能力，为公文写作和应用领域提供有力的科学依据和决策支撑。1.数据采集：从不同类型的政务公开网站上间隔性获取最新的政策公文，将其中的政策文本数据进行保存。 2.数据处理：对采集的政策文本数据进行数据清洗，仅保留可用的政策公文信息。剔除文本数据中可能存在的URL、IP地址、电子邮件、手机号码、电话号码、身份证号码信息 3.政策摘要生成：对清洗后的文本数据进行摘要提取，基于注意力机制序列模型得到各政策原文对应的政策摘要文本。对于每一个政策数据文本，顺序输入到模型的编码器中。模型内的解码器会根据上一时间点的输出得到当前时间点的隐藏状态并由此计算注意力系数（注意力系数为模型计算过程的中间值，不存储）。该值经过一系列加权处理后与用于分类的两个线性层相连，最终得到各政策文本数据的摘要内容。 4.摘要数据向量化：使用bge-large-zh-v1.5嵌入模型对生成的政策摘要文本进行向量化处理，摘要文本与向量存入数据库，以供后续的业务使用。

"Zhengcancou" Policy Large Model targets government, enterprise and research personnel. Built on Retrieval-Augmented Generation (RAG) and generative AI technologies, it focuses on capabilities such as document-assisted writing, policy calculator, and generative QA, providing policy-oriented intelligent products and solutions. It helps government agencies, enterprises and other organizations improve their policy service capabilities, and provides solid scientific basis and decision support for document writing and application fields. 1. Data Collection: Intermittently acquire the latest policy documents from various types of government public disclosure websites, and save the collected policy text data. 2. Data Processing: Clean the collected policy text data, only retaining valid policy document information. Remove URLs, IP addresses, email addresses, mobile phone numbers, landline phone numbers and ID card numbers that may exist in the text data. 3. Policy Summary Generation: Extract summaries from the cleaned text data, and generate policy summary texts corresponding to each original policy document via an attention mechanism-based sequence model. For each piece of policy data text, input it sequentially into the model's encoder. The decoder in the model will acquire the hidden state at the current time step based on the output of the previous time step, and then calculate the attention coefficient (the attention coefficient is an intermediate value during the model's calculation process and will not be stored). This value undergoes a series of weighted processing and is connected to two linear layers for classification, finally obtaining the summary content of each policy text data. 4. Summary Data Vectorization: Use the bge-large-zh-v1.5 embedding model to vectorize the generated policy summary texts, and store the summary texts and their vectors in a database for subsequent business usage.

提供机构：

浙数城市大脑（浙江）有限公司

创建时间：

2024-11-11

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成