five

DanielRegaladoCardoso/text-to-sql-mix-v2

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - zh license: apache-2.0 task_categories: - text-generation task_ids: - language-modeling tags: - sql - text-to-sql - code-generation - instruction-tuning - database - nl2sql pretty_name: Text-to-SQL Training Mix v2 size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- ## 🔗 Part of the SQL Agent LLMOps project This dataset is one of three purpose-built training mixes for the [**SQL Agent LLMOps**](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project — an end-to-end pipeline that converts natural-language questions into SQL, executes the query on user data, and renders a storytelling-grade visualization. | Dataset | Model trained | Role | |---------|---------------|------| | 🤗 [`DanielRegaladoCardoso/text-to-sql-mix-v2`](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Qwen 2.5 Coder 7B | NL question → SQL | | 🤗 [`DanielRegaladoCardoso/chart-reasoning-mix-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1) | Phi-3 Mini 3.8B | (question + result schema) → chart spec | | 🤗 [`DanielRegaladoCardoso/svg-chart-render-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1) | DeepSeek Coder 1.3B | chart spec → inline SVG | Full build pipelines (all open-source, UV-executable): - [`training/data_pipelines/build_sql_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_sql_mix.py) - [`training/data_pipelines/build_chart_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_chart_mix.py) - [`training/data_pipelines/build_svg_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_svg_mix.py) Training notebooks (Unsloth + Colab, ready to run): - [`train_sql_generator.ipynb`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/notebooks/train_sql_generator.ipynb) - [`train_chart_reasoner.ipynb`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/notebooks/train_chart_reasoner.ipynb) - [`train_svg_renderer.ipynb`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/notebooks/train_svg_renderer.ipynb) # 🗃️ Text-to-SQL Training Mix · v2 A curated, deduplicated and quality-filtered mix of **10 open text-to-SQL** datasets, designed for fine-tuning code LLMs (Qwen 2.5 Coder, DeepSeek Coder, Llama-3, etc.) on natural-language-to-SQL generation. > Powers the SQL Generator in the > [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project. | 📊 Total | 🧹 Post-dedup kept | 🧪 Filter pass rate | |-----------|--------------------|---------------------| | **761,155** rows | ~72% of raw | ~99% sqlglot-parseable | ## ✨ What makes this mix different - **10 sources combined**, spanning academic benchmarks (Spider, BIRD, WikiSQL, MIMIC-III, ATIS), synthetic generators (Gretel), community instruct sets, and a DuckDB-dialect collection from the MotherDuck team. - **Single normalized schema** — all rows share `instruction`, `schema_context`, `sql`, plus `source`, `dialect`, `difficulty` metadata. - **Aggressively deduplicated** with MD5 over `(instruction, sql)` — removes ~28% of raw rows that overlap across sources. - **SQL-validated** via [`sqlglot`](https://github.com/tobymao/sqlglot) — unparseable rows are dropped. - **Heuristic difficulty labels** (`easy` / `medium` / `hard`) based on joins, CTEs, window functions, nested SELECTs. ## 📐 Schema | Field | Type | Description | |-------|------|-------------| | `id` | `string` | Stable hash-based identifier (`<source>-<hash>`) | | `instruction` | `string` | Natural-language question | | `schema_context` | `string` | Database schema, typically one or more `CREATE TABLE` statements (may be empty for a small minority of rows) | | `sql` | `string` | Target SQL query — parseable by `sqlglot` | | `source` | `string` | Origin tag (see attribution table below) | | `dialect` | `string` | `generic`, `postgres`, `sqlite`, `duckdb` | | `difficulty` | `string` | `easy`, `medium`, `hard`, or `unknown` | ## 📦 Splits | Split | Rows | |-------|------| | `train` | 723,097 | | `validation` | 19,029 | | `test` | 19,029 | Ratios: **95 / 2.5 / 2.5** (train / validation / test). Random seed `42`. ## 🔬 Quick preview ### Example 1 · `nstext2sql-sql_create_context` · _easy_ **Question** > -- What is the lowest silver that has 1 for the bronze, 1 as the total, 17 as the rank, with a gold less than 0? **Schema** ```sql CREATE TABLE table_name_15 ( silver INTEGER, gold VARCHAR, rank VARCHAR, bronze VARCHAR, total VARCHAR ) -- Using valid SQLite, answer the following questions for the tables provided above. ``` **SQL** ```sql SELECT MIN(silver) FROM table_name_15 WHERE bronze = 1 AND total = 1 AND rank = "17" AND gold < 0 ``` ### Example 2 · `clinton-text2sql` · _medium_ **Question** > What is the average age for each gender. Visualize by bar chart. **Schema** ```sql CREATE TABLE Person ( name varchar(20), age INTEGER, city TEXT, gender TEXT, job TEXT ) CREATE TABLE PersonFriend ( name varchar(20), friend varchar(20), year INTEGER ) ``` **SQL** ```sql SELECT gender, AVG(age) FROM Person GROUP BY gender ``` ### Example 3 · `gretel-synthetic` · _easy_ **Question** > Delete records of community health workers who do not have a valid work permit. **Schema** ```sql CREATE TABLE community_health_workers (id INT, name VARCHAR, work_permit BOOLEAN); INSERT INTO community_health_workers (id, name, work_permit) VALUES (1, 'John Doe', TRUE), (2, 'Jane Smith', FALSE); ``` **SQL** ```sql DELETE FROM community_health_workers WHERE work_permit = FALSE; ``` ## 🌎 Language Most rows are **English**. A meaningful portion of rows imported from NSText2SQL subsets (notably `nstext2sql-css`, `nstext2sql-mimicsql`, some medical subsets) are in **Chinese** — reflecting real-world multilingual SQL corpora. Filter by `source` if you need English-only training. ## 📊 Composition ### By source | Source tag | Rows | Share | |------------|------|-------| | `clinton-text2sql` | 182,252 | 23.9% | | `gretel-synthetic` | 99,927 | 13.1% | | `kaxap-llama2` | 81,358 | 10.7% | | `nstext2sql-wikisql` | 80,442 | 10.6% | | `sql-create-context` | 78,387 | 10.3% | | `nstext2sql-sql_create_context` | 75,774 | 10.0% | | `nstext2sql-nvbench` | 22,929 | 3.0% | | `motherduck-duckdb` | 22,472 | 3.0% | | `nstext2sql-css` | 22,050 | 2.9% | | `nstext2sql-mimicsql_data` | 19,986 | 2.6% | | `pipable-spider-bird` | 14,018 | 1.8% | | `nstext2sql-squall` | 10,595 | 1.4% | | `nstext2sql-sede` | 10,027 | 1.3% | | `bugdaryan-spider-wikisql` | 7,744 | 1.0% | | `nstext2sql-eicu` | 7,726 | 1.0% | | `nstext2sql-mimic_iii` | 7,639 | 1.0% | | `nstext2sql-spider` | 4,948 | 0.7% | | `nstext2sql-atis` | 4,894 | 0.6% | | `nstext2sql-advising` | 4,381 | 0.6% | | `nstext2sql-criteria2sql` | 1,997 | 0.3% | | `nstext2sql-scholar` | 768 | 0.1% | | `nstext2sql-academic` | 185 | 0.0% | | `nstext2sql-imdb` | 131 | 0.0% | | `nstext2sql-yelp` | 128 | 0.0% | | `nstext2sql-restaurants` | 125 | 0.0% | | `nstext2sql-pesticide` | 50 | 0.0% | | `nstext2sql-whatcdhiphop` | 41 | 0.0% | | `nstext2sql-thehistoryofbaseball` | 39 | 0.0% | | `nstext2sql-uswildfires` | 37 | 0.0% | | `nstext2sql-geonucleardata` | 32 | 0.0% | | `nstext2sql-studentmathscore` | 28 | 0.0% | | `nstext2sql-greatermanchestercrime` | 27 | 0.0% | | `nstext2sql-worldsoccerdatabase` | 18 | 0.0% | ### By difficulty | Difficulty | Rows | Share | |------------|------|-------| | `easy` | 544,047 | 71.5% | | `hard` | 109,302 | 14.4% | | `medium` | 107,806 | 14.2% | ### By dialect | Dialect | Rows | Share | |---------|------|-------| | `generic` | 616,994 | 81.1% | | `postgres` | 99,927 | 13.1% | | `duckdb` | 22,472 | 3.0% | | `sqlite` | 21,762 | 2.9% | ## 📜 Source attribution This dataset is a derivative combining the sources below. **Please check each upstream license before commercial use** — the Apache-2.0 license on this mix covers the build pipeline and curation, not the underlying row content, which remains under its original license. | Source | Tag | License | Notes | |--------|-----|---------|-------| | [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) | `sql-create-context` | CC-BY-4.0 | Builds on Spider and WikiSQL; adds CREATE TABLE context. | | [gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) | `gretel-synthetic` | Apache-2.0 | 105k synthetic examples across 11 domains (finance, healthcare, retail, …). | | [knowrohit07/know_sql](https://huggingface.co/datasets/knowrohit07/know_sql) | `know_sql` | Apache-2.0 | Compact and clean. Fully deduplicated into sql-create-context in v2. | | [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1) | `clinton-text2sql` | Apache-2.0 | Large instruction-tuned SQL dataset (262k → 173k post-dedup). | | [NumbersStation/NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) | `nstext2sql-*` | See source | 290k multi-dialect examples aggregated from 20+ upstream datasets (Spider, WikiSQL, MIMIC-III, ATIS, eICU, …). | | [ChrisHayduk/Llama-2-SQL-Dataset](https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-Dataset) | `hayduk-llama2-sql` | Apache-2.0 | Llama-2 instruct format. Fully deduplicated into overlapping sources in v2. | | [motherduckdb/duckdb-text2sql-25k](https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k) | `motherduck-duckdb` | CC-BY-4.0 | 25k DuckDB-dialect SQL by the MotherDuck team. | | [PipableAI/pip-txt-to-sql-spider-bird-dataset](https://huggingface.co/datasets/PipableAI/pip-txt-to-sql-spider-bird-dataset) | `pipable-spider-bird` | Apache-2.0 | Spider + BIRD benchmarks with inline `CREATE TABLE` schemas. | | [kaxap/llama2-sql-instruct](https://huggingface.co/datasets/kaxap/llama2-sql-instruct) | `kaxap-llama2` | Apache-2.0 | Llama-2 `[INST]` format; unpacked into schema + question + SQL. | | [bugdaryan/spider-natsql-wikisql-instruct](https://huggingface.co/datasets/bugdaryan/spider-natsql-wikisql-instruct) | `bugdaryan-spider-wikisql` | Apache-2.0 | Spider + NatSQL + WikiSQL packed into Alpaca format. | ## 🛠️ Build pipeline Every row in this dataset is produced by this deterministic pipeline: 1. **Download** each source from the HuggingFace Hub. 2. **Normalize** into the common schema shown above — includes regex- based unpacking of packed Alpaca / Llama-2 formats. 3. **Filter**: drop rows where SQL is unparseable by `sqlglot`, or `len(instruction) > 2000` / `len(sql) > 4000`. 4. **Deduplicate** by MD5 over `(lower(instruction), lower(sql))`. 5. **Difficulty tagging** — heuristics on JOINs, CTEs, window funcs, nesting. 6. **Shuffle + 95 / 2.5 / 2.5 split** (seed 42). Build script is open source: [`training/data_pipelines/build_sql_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_sql_mix.py) ## 🚀 Usage ### Quick start ```python from datasets import load_dataset ds = load_dataset("DanielRegaladoCardoso/text-to-sql-mix-v2") print(ds["train"][0]) ``` ### SFT prompt template (Qwen / Llama-3 style) ```python SYSTEM = "You are a SQL expert. Generate correct SQL given a schema." def to_sft(row): user = ( f"### Schema\n{row['schema_context']}\n\n" f"### Question\n{row['instruction']}\n\n" "### SQL" ) return { "messages": [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": user}, {"role": "assistant", "content": row["sql"]}, ] } ds = ds.map(to_sft, remove_columns=ds['train'].column_names) ``` ### Filter to a subset (e.g. only hard English PostgreSQL) ```python hard = ds['train'].filter( lambda r: r['difficulty'] == 'hard' and r['dialect'] == 'postgres' ) # ~99,927 rows × hard ratio ``` ## ⚠️ Known limitations - **Heuristic difficulty**: the `difficulty` field is derived from surface features of the SQL, not from human judgment. Use as a rough guide only. - **Schema quality varies**: some `schema_context` rows are a single `CREATE TABLE`, others are a full multi-table schema with foreign keys; a small minority (≈ 0.3% of rows) have an empty schema. - **Dedup is exact, not semantic**: paraphrased questions with identical SQL are still counted as distinct rows. - **SQL dialects are mixed** — `sqlglot` is dialect-agnostic for parsing, but fine-tuning may benefit from separating dialects by task. ## 📝 Citation If you use this dataset, please cite both this mix and the original sources: ```bibtex @dataset{regalado2026textsqlmix, author = {Regalado Cardoso, Daniel}, title = {Text-to-SQL Training Mix v2}, year = {2026}, url = {https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2} } ``` ## ⚖️ License The **build pipeline, curation, and metadata** (this repository's contents) are released under **Apache-2.0**. The **row content** inherits the license of its upstream source. See the attribution table for per-source licenses; favour the strictest license of the sources you actually use (filter by `source`). --- _Built with ❤️ in Miami by Daniel Regalado Cardoso · MSBA candidate at the University of Miami · April 2026._
提供机构:
DanielRegaladoCardoso
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作