five

synthetic_text_to_sql

收藏
魔搭社区2026-05-15 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/synthetic_text_to_sql
下载链接
链接失效反馈
官方服务:
资源简介:
<center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/r1h33ovUdfqsS_nh15hv1.webp" alt="gretelai/synthetic_text_to_sql v1" width="600px"> <p><em>Image generated by DALL-E. See <a href="https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/dalle_prompt.txt">prompt</a> for more details</em></p> </center> # synthetic_text_to_sql <!-- Provide a quick summary of the dataset. --> **gretelai/synthetic_text_to_sql** is a rich dataset of high quality synthetic Text-to-SQL samples, designed and generated using [Gretel Navigator](https://gretel.ai/gretel-navigator), and released under Apache 2.0. Please see our [release blogpost](https://gretel.ai/blog/synthetic-text-to-sql-dataset) for more details. The dataset includes: <ul> <li>105,851 records partitioned into 100,000 train and 5,851 test records</li> <li>~23M total tokens, including ~12M SQL tokens</li> <li>Coverage across 100 distinct domains/verticals</li> <li>Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting</li> <li>Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations</li> <li>Database context, including table and view create statements</li> <li>Natural language explanations of what the SQL query is doing</li> <li>Contextual tags to optimize model training</li> </ul> As of April 2024, gretelai/synthetic_text_to_sql dataset stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date. It is not just a milestone in the world of synthetic data; it's an invitation to the broader AI community. We invite developers, researchers, and data enthusiasts to take the dataset for a spin, and build upon it. If you end up using this dataset, drop us a note in the [Synthetic Data Discord](https://gretel.ai/discord) community. We'd love to hear what you are building! This release is also merely a glimpse into the capabilities of Gretel. The real value of synthetic data lies in the ability to design and iterate on data to address specific data gaps, incorporate unique business logic, and to infuse with use-case-specific context. We invite you to explore Gretel tools and capabilities to accelerate your journey towards [data-centric AI](https://datacentricai.org/). ## Dataset Details ### Schema The dataset includes 11 fields shown below: <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/DrD6dqAOBuSr7xsXir9ku.png" width="600px"> ### Example ``` { "id": 39325, "domain": "public health", "domain_description": "Community health statistics, infectious disease tracking data, healthcare access metrics, and public health policy analysis.", "sql_complexity": "aggregation", "sql_complexity_description": "aggregation functions (COUNT, SUM, AVG, MIN, MAX, etc.), and HAVING clause", "sql_task_type": "analytics and reporting", "sql_task_type_description": "generating reports, dashboards, and analytical insights", "sql_prompt": "What is the total number of hospital beds in each state?", "sql_context": "CREATE TABLE Beds (State VARCHAR(50), Beds INT); INSERT INTO Beds (State, Beds) VALUES ('California', 100000), ('Texas', 85000), ('New York', 70000);", "sql": "SELECT State, SUM(Beds) FROM Beds GROUP BY State;", "sql_explanation": "This query calculates the total number of hospital beds in each state in the Beds table. It does this by using the SUM function on the Beds column and grouping the results by the State column." } ``` ### Dataset Description <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/JhBjtBsy7TYSqUZkqsN2e.png" alt="dataset features" width="600px"> <p>Breakdown of text to SQL dataset features and corresponding data types and token counts</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/-1W1Xn1zEcg-VXLsbz3od.png" alt="sql complexity breakdown" width="900px"> <p>Breakdown by SQL complexity</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/f7mdpPHGCyT5z3Amr8OPk.png" alt="sql complexity breakdown" width="700px"> <p>Breakdown by SQL task type</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/kdukRodUbleA-4DzOVHBf.png" alt="domain distribution" width="900px"> <p>Domain Distribution</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/wVvE3Mbi_0nwwD90qCaFG.png" alt="token distributions" width="900px"> <p>Token Distributions</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/hGnc5m0xehY2LZksnvrwS.png" alt="word clouds" width="900px"> <p>Word clouds for the natural language prompt, database context, SQL, and SQL explanation</p> </center> ### Data Quality Assessment In order to assess the quality of our Text-to-SQL data, we leveraged the [LLM-as-a-judge technique](https://arxiv.org/pdf/2306.05685.pdf) (see also our [blog](https://gretel.ai/blog/synthetic-text-to-sql-dataset) for more details). We holistically evaluate the quality of SQL across 1,000 randomly chosen samples of data. We use GPT-4 to score samples from our Text-to-SQL dataset and compare results to 1,000 randomly chosen samples from the [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) dataset, which is an extension of the [Spider](https://huggingface.co/datasets/spider) dataset, and includes database context for an apples-to-apples comparison. We observe that our dataset consistently scores higher on: - Compliance with SQL Standards: +54.6% - SQL Correctness: +34.5% - Adherence to Instructions: +8.5% <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/2MFedbL0cEqm12q6Wpzn8.png" alt="LLM-as-a-judge evaluation" width="900px"> <p>LLM-as-a-judge comparison of gretelai/synthetict_text_to_sql with b-mc2/sql-create-context dataset across five different criteria: (i) Adherence to Instructions, (ii) SQL Correctness, (iii) Readability and Maintanability, (iv) Scalability, and (v) Compliance with Standards</p> </center> See the [grading rubric](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/llm_as_a_judge_rubric.txt) with explicit criteria used for the LLM-as-a-judge evaluation. We also include two examples of LLM judgements for the b-mc2/sql-create-context dataset: - [example 1](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/bmc2_llm_judge_example_1.txt) - [example 2](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/bmc2_llm_judge_example_2.txt) In addition to the above, the parsability and validity of SQL in both sql_context and sql fields has been verified using a python SQL Parser/Transpiler [sqlglot](https://github.com/tobymao/sqlglot) and a SQL format/syntax/semantics validator [sqlvalidator](https://github.com/David-Wobrock/sqlvalidator): <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/5yfffwTxZiIJ58fwwvopC.png" width="700px"> <p>Breakdown of SQL parsability and validity for gretelai/synthetict_text_to_sql and b-mc2/sql-create-context</p> </center> ## Citation ``` @software{gretel-synthetic-text-to-sql-2024, author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew}, title = {{Synthetic-Text-To-SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts}, month = {April}, year = {2024}, url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql} } ```

<center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/r1h33ovUdfqsS_nh15hv1.webp" alt="gretelai/synthetic_text_to_sql v1" width="600px"> <p><em>该图片由DALL-E生成。详细提示词请参见<a href="https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/dalle_prompt.txt">此处</a></em></p> </center> # synthetic_text_to_sql <!-- 提供数据集简要概述。 --> **gretelai/synthetic_text_to_sql** 是一个高质量的合成文本到SQL(Text-to-SQL)样本集,由[Gretel Navigator](https://gretel.ai/gretel-navigator)设计并生成,采用Apache 2.0协议开源。更多细节请参阅我们的[发布博客文章](https://gretel.ai/blog/synthetic-text-to-sql-dataset)。 本数据集包含以下内容: <ul> <li>总计105,851条数据,划分为100,000条训练样本与5,851条测试样本</li> <li>总Token数约2300万,其中SQL相关Token约1200万</li> <li>覆盖100个不同的领域与垂直场景</li> <li>涵盖全品类SQL任务:数据定义、数据检索、数据操作、分析与报表生成</li> <li>包含全梯度的SQL复杂度等级,涵盖子查询、单表连接、多表连接、聚合运算、窗口函数以及集合操作</li> <li>附带数据库上下文信息,包含表与视图的创建语句</li> <li>针对SQL查询功能的自然语言解释</li> <li>包含用于优化模型训练的上下文标签</li> </ul> 截至2024年4月,gretelai/synthetic_text_to_sql数据集是目前已公开的规模最大、覆盖最全面的合成文本到SQL数据集。它不仅是合成数据领域的一座里程碑,更是面向广大人工智能社区的一份邀请。我们诚邀开发者、研究者与数据爱好者体验该数据集并基于其开展研究工作。若您使用了本数据集,欢迎在[Synthetic Data Discord社区](https://gretel.ai/discord)中与我们分享您的成果,我们期待聆听您的创作! 本次发布仅为Gretel能力的冰山一角。合成数据的真正价值,在于能够针对特定数据缺口设计并迭代数据集、融入独特业务逻辑,以及注入场景专属上下文信息。我们诚邀您体验Gretel的各类工具与功能,加速您迈向[数据中心AI(data-centric AI)](https://datacentricai.org/)的征程。 ## 数据集详情 ### 数据集架构 本数据集包含如下11个字段: <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/DrD6dqAOBuSr7xsXir9ku.png" width="600px"> ### 示例 { "id": 39325, "domain": "公共卫生", "domain_description": "社区卫生统计、传染病追踪数据、医疗可及性指标以及公共卫生政策分析。", "sql_complexity": "聚合运算", "sql_complexity_description": "聚合函数(COUNT、SUM、AVG、MIN、MAX等)以及HAVING子句", "sql_task_type": "分析与报表生成", "sql_task_type_description": "生成报表、仪表盘与分析洞察", "sql_prompt": "每个州的医院病床总数是多少?", "sql_context": "CREATE TABLE Beds (State VARCHAR(50), Beds INT); INSERT INTO Beds (State, Beds) VALUES ('California', 100000), ('Texas', 85000), ('New York', 70000);", "sql": "SELECT State, SUM(Beds) FROM Beds GROUP BY State;", "sql_explanation": "本查询用于计算Beds表中每个州的医院病床总数,通过对Beds列使用SUM函数并按State列分组实现结果聚合。" } ### 数据集说明 <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/JhBjtBsy7TYSqUZkqsN2e.png" alt="数据集特征分布" width="600px"> <p>文本到SQL数据集的特征、对应数据类型与Token数分布</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/-1W1Xn1zEcg-VXLsbz3od.png" alt="SQL复杂度分布" width="900px"> <p>按SQL复杂度维度的分布情况</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/f7mdpPHGCyT5z3Amr8OPk.png" alt="SQL任务类型分布" width="700px"> <p>按SQL任务类型维度的分布情况</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/kdukRodUbleA-4DzOVHBf.png" alt="领域分布" width="900px"> <p>领域分布情况</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/wVvE3Mbi_0nwwD90qCaFG.png" alt="Token分布" width="900px"> <p>Token分布情况</p> </center> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/hGnc5m0xehY2LZksnvrwS.png" alt="词云图" width="900px"> <p>自然语言提示词、数据库上下文、SQL语句以及SQL解释的词云图</p> </center> ### 数据质量评估 为评估本文本到SQL数据集的质量,我们采用了[大模型作为评判者(LLM-as-a-judge)技术](https://arxiv.org/pdf/2306.05685.pdf)(更多细节可参阅我们的[博客文章](https://gretel.ai/blog/synthetic-text-to-sql-dataset))。我们随机选取1000条数据样本,对SQL语句的质量进行全方位评估。我们使用GPT-4为本数据集的样本打分,并将结果与从[b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context)数据集随机选取的1000条样本进行对比;该数据集是[Spider](https://huggingface.co/datasets/spider)数据集的扩展版本,附带数据库上下文信息,可实现公平的对标比较。 我们的评估结果显示,本数据集在以下维度的得分均更高: <ul> <li>符合SQL标准:提升54.6%</li> <li>SQL语句正确性:提升34.5%</li> <li>指令遵循度:提升8.5%</li> </ul> <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/2MFedbL0cEqm12q6Wpzn8.png" alt="大模型作为评判者评估对比" width="900px"> <p>大模型作为评判者(LLM-as-a-judge)对gretelai/synthetic_text_to_sql与b-mc2/sql-create-context数据集的对比评估,涵盖五大维度:(i) 指令遵循度,(ii) SQL正确性,(iii) 可读性与可维护性,(iv) 可扩展性,以及(v) 标准合规性</p> </center> 有关本次大模型作为评判者评估所使用的详细评分标准,请参见[评分准则](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/llm_as_a_judge_rubric.txt)。我们还提供了两条针对b-mc2/sql-create-context数据集的大模型评判示例: <ul> <li><a href="https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/bmc2_llm_judge_example_1.txt">示例1</a></li> <li><a href="https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/blob/main/bmc2_llm_judge_example_2.txt">示例2</a></li> </ul> 除此之外,我们使用Python编写的SQL解析/转译工具[sqlglot](https://github.com/tobymao/sqlglot)以及SQL格式、语法与语义验证工具[sqlvalidator](https://github.com/David-Wobrock/sqlvalidator),验证了sql_context与sql字段中SQL语句的可解析性与合法性: <center> <img src="https://cdn-uploads.huggingface.co/production/uploads/5e39c39bf55e2b62848a520f/5yfffwTxZiIJ58fwwvopC.png" width="700px"> <p>gretelai/synthetic_text_to_sql与b-mc2/sql-create-context数据集的SQL可解析性与合法性分布</p> </center> ## 引用 @software{gretel-synthetic-text-to-sql-2024, author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew}, title = {{Synthetic-Text-To-SQL:面向语言模型从自然语言提示生成SQL查询的合成数据集}}, month = {April}, year = {2024}, url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql} }
提供机构:
maas
创建时间:
2024-04-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作