金赋智能用数训练数据集

Name: 金赋智能用数训练数据集
Creator: 广东金赋科技股份有限公司
Published: 2024-08-19 00:00:00
License: 暂无描述

广东省数据知识产权存证登记平台2024-08-19 更新2024-09-15 收录

下载链接：

https://data.gpic.gd.cn/dataStorage/credentialInfo.jhtml?no=20240844000004673

下载链接

链接失效反馈

官方服务：

资源简介：

目前针对通用大模型的数据集已经有很多了，然而这个仅仅是简单的语言类模型，对应的训练知识也只是语言类文本知识，对应的模型仅能提供聊天机器人类似的服务，无法执行具体任务。这个时候就需要垂直大模型来完成特定场景任务，而针对特定场景的模型训练数据集就比较稀缺。本数据集是金赋科技自主训练垂直类NL2SQL大模型过程中，自行生产的训练数据集。本训练数据集主要包括了用数场景的instruction、标签、问题、用到的字段、、知识、fewshot内容、参考答案SQL、库名、分类、难度等。通过对标准化SQL进行关键词分类，根据SQL的DML不同情况、不同组合，分为简单查询、一般查询、复杂查询。根据不同的情况下，对instruction进行工程处理，并提供一套机制评估生成的SQL与参考答案SQL进行比对，差异大的通过人工反馈机制，结果反馈给模型进行调整，持续提高准确率。本数据集可以用于NL2SQL垂直类大模型训练使用，通过高质量数据集训练，提高该垂直类大模型从自然语言到SQL转化的准确度。支持政府、企业内部通过对话方式高效获取数据，无需技术参与，实现数据所问即所见。

There are numerous existing datasets for general large language models, but these are merely plain linguistic models whose training corpus only covers textual linguistic knowledge. Such models can only provide chatbot-like services and are incapable of executing concrete practical tasks. At this juncture, vertical domain-specific large language models are required to fulfill tasks in specific scenarios, yet training datasets for such scenario-specific models are relatively scarce. This dataset is a self-developed training corpus produced by Kingfu Technology during the independent training of a vertical NL2SQL large language model. The training dataset mainly covers data query scenarios, including instructions, labels, questions, utilized fields, domain knowledge, few-shot examples, reference SQL queries, database names, query categories, difficulty levels, and other relevant contents. By conducting keyword classification on standardized SQL and categorizing queries into simple, general, and complex queries based on different scenarios and combinations of SQL DML statements, corresponding engineering processing is performed on the instructions according to specific cases. Additionally, an evaluation mechanism is established to compare the generated SQL with the reference SQL queries. For cases with significant discrepancies, a human feedback loop is adopted, where the feedback results are fed back to the model for fine-tuning, so as to continuously improve the prediction accuracy. This dataset can be used for training vertical NL2SQL large language models. Through training with high-quality datasets, the accuracy of natural language-to-SQL conversion for such vertical large language models can be effectively enhanced. It supports governments and internal enterprise scenarios to efficiently access data via conversational interactions, without the need for technical personnel involvement, thus achieving the goal of "what you ask is what you get" for data queries.

提供机构：

广东金赋科技股份有限公司

创建时间：

2024-08-19

搜集汇总

数据集介绍

特点

金赋智能用数训练数据集是一个专为NL2SQL垂直类大模型设计的训练数据集，旨在通过高质量数据提升自然语言到SQL的转换准确率。数据集包含丰富字段，适用于多种商业分析场景，数据格式为CSV，月更新频率，总数据量3100条。

以上内容由遇见数据集搜集并总结生成