OpenDCAI/dataflow-demo-Text2SQL
收藏Hugging Face2026-02-04 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/OpenDCAI/dataflow-demo-Text2SQL
下载链接
链接失效反馈官方服务:
资源简介:
该数据集来自DataFlow项目,包含多个JSON分割,展示了常见的文本到SQL(Text-to-SQL)训练数据格式,涵盖原始输入、精炼输出和增强样本。数据集可用于训练和增强大型语言模型的Text-to-SQL生成能力,提高其在Text-to-SQL任务上的泛化性能。数据集的分割包括input_example(400条记录,展示种子数据格式)、output_example(1368条记录,展示增强后的数据格式)、sqlflow_bird(37521条记录,源自Bird训练数据集)、sqlflow_ehrsql(14491条记录,源自EHRSQL训练数据集)和sqlflow_spider(37537条记录,源自Spider训练数据集)。数据集的字段包括数据库标识符(db_id)、自然语言问题(question)、SQL查询(sql)、推理跟踪(cot)、外部知识(external_knowledge)、完整提示上下文(prompt)、问题风格标签(question_style)以及难度注释(sql_component_difficulty和sql_execution_difficulty)等。
This dataset is part of the DataFlow project and includes multiple JSON splits showcasing common Text-to-SQL training data formats, covering raw inputs, refined outputs, and augmented samples. The dataset can be used to train and enhance large language models Text-to-SQL generation capabilities, improving their generalization performance on Text-to-SQL tasks. The splits include input_example (400 records, demonstrating seed data format), output_example (1368 records, demonstrating augmented data format), sqlflow_bird (37521 records, derived from the Bird training dataset), sqlflow_ehrsql (14491 records, derived from the EHRSQL training dataset), and sqlflow_spider (37537 records, derived from the Spider training dataset). The datasets fields include database identifier (db_id), natural language question (question), SQL query (sql), reasoning trace (cot), external knowledge (external_knowledge), full prompt context (prompt), question style tag (question_style), and difficulty annotations (sql_component_difficulty and sql_execution_difficulty), among others.
提供机构:
OpenDCAI



