DanielRegaladoCardoso/text-to-sql-mix-v2
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zh
license: apache-2.0
task_categories:
- text-generation
task_ids:
- language-modeling
tags:
- sql
- text-to-sql
- code-generation
- instruction-tuning
- database
- nl2sql
pretty_name: Text-to-SQL Training Mix v2
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
## 🔗 Part of the SQL Agent LLMOps project
This dataset is one of three purpose-built training mixes for the
[**SQL Agent LLMOps**](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project — an end-to-end pipeline that
converts natural-language questions into SQL, executes the query on user
data, and renders a storytelling-grade visualization.
| Dataset | Model trained | Role |
|---------|---------------|------|
| 🤗 [`DanielRegaladoCardoso/text-to-sql-mix-v2`](https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2) | Qwen 2.5 Coder 7B | NL question → SQL |
| 🤗 [`DanielRegaladoCardoso/chart-reasoning-mix-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/chart-reasoning-mix-v1) | Phi-3 Mini 3.8B | (question + result schema) → chart spec |
| 🤗 [`DanielRegaladoCardoso/svg-chart-render-v1`](https://huggingface.co/datasets/DanielRegaladoCardoso/svg-chart-render-v1) | DeepSeek Coder 1.3B | chart spec → inline SVG |
Full build pipelines (all open-source, UV-executable):
- [`training/data_pipelines/build_sql_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_sql_mix.py)
- [`training/data_pipelines/build_chart_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_chart_mix.py)
- [`training/data_pipelines/build_svg_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_svg_mix.py)
Training notebooks (Unsloth + Colab, ready to run):
- [`train_sql_generator.ipynb`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/notebooks/train_sql_generator.ipynb)
- [`train_chart_reasoner.ipynb`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/notebooks/train_chart_reasoner.ipynb)
- [`train_svg_renderer.ipynb`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/notebooks/train_svg_renderer.ipynb)
# 🗃️ Text-to-SQL Training Mix · v2
A curated, deduplicated and quality-filtered mix of **10 open text-to-SQL**
datasets, designed for fine-tuning code LLMs (Qwen 2.5 Coder, DeepSeek Coder,
Llama-3, etc.) on natural-language-to-SQL generation.
> Powers the SQL Generator in the
> [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project.
| 📊 Total | 🧹 Post-dedup kept | 🧪 Filter pass rate |
|-----------|--------------------|---------------------|
| **761,155** rows | ~72% of raw | ~99% sqlglot-parseable |
## ✨ What makes this mix different
- **10 sources combined**, spanning academic benchmarks (Spider, BIRD,
WikiSQL, MIMIC-III, ATIS), synthetic generators (Gretel), community
instruct sets, and a DuckDB-dialect collection from the MotherDuck team.
- **Single normalized schema** — all rows share `instruction`,
`schema_context`, `sql`, plus `source`, `dialect`, `difficulty` metadata.
- **Aggressively deduplicated** with MD5 over `(instruction, sql)` — removes
~28% of raw rows that overlap across sources.
- **SQL-validated** via [`sqlglot`](https://github.com/tobymao/sqlglot) —
unparseable rows are dropped.
- **Heuristic difficulty labels** (`easy` / `medium` / `hard`) based on
joins, CTEs, window functions, nested SELECTs.
## 📐 Schema
| Field | Type | Description |
|-------|------|-------------|
| `id` | `string` | Stable hash-based identifier (`<source>-<hash>`) |
| `instruction` | `string` | Natural-language question |
| `schema_context` | `string` | Database schema, typically one or more `CREATE TABLE` statements (may be empty for a small minority of rows) |
| `sql` | `string` | Target SQL query — parseable by `sqlglot` |
| `source` | `string` | Origin tag (see attribution table below) |
| `dialect` | `string` | `generic`, `postgres`, `sqlite`, `duckdb` |
| `difficulty` | `string` | `easy`, `medium`, `hard`, or `unknown` |
## 📦 Splits
| Split | Rows |
|-------|------|
| `train` | 723,097 |
| `validation` | 19,029 |
| `test` | 19,029 |
Ratios: **95 / 2.5 / 2.5** (train / validation / test). Random seed `42`.
## 🔬 Quick preview
### Example 1 · `nstext2sql-sql_create_context` · _easy_
**Question**
> -- What is the lowest silver that has 1 for the bronze, 1 as the total, 17 as the rank, with a gold less than 0?
**Schema**
```sql
CREATE TABLE table_name_15 (
silver INTEGER,
gold VARCHAR,
rank VARCHAR,
bronze VARCHAR,
total VARCHAR
)
-- Using valid SQLite, answer the following questions for the tables provided above.
```
**SQL**
```sql
SELECT MIN(silver) FROM table_name_15 WHERE bronze = 1 AND total = 1 AND rank = "17" AND gold < 0
```
### Example 2 · `clinton-text2sql` · _medium_
**Question**
> What is the average age for each gender. Visualize by bar chart.
**Schema**
```sql
CREATE TABLE Person (
name varchar(20),
age INTEGER,
city TEXT,
gender TEXT,
job TEXT
)
CREATE TABLE PersonFriend (
name varchar(20),
friend varchar(20),
year INTEGER
)
```
**SQL**
```sql
SELECT gender, AVG(age) FROM Person GROUP BY gender
```
### Example 3 · `gretel-synthetic` · _easy_
**Question**
> Delete records of community health workers who do not have a valid work permit.
**Schema**
```sql
CREATE TABLE community_health_workers (id INT, name VARCHAR, work_permit BOOLEAN); INSERT INTO community_health_workers (id, name, work_permit) VALUES (1, 'John Doe', TRUE), (2, 'Jane Smith', FALSE);
```
**SQL**
```sql
DELETE FROM community_health_workers WHERE work_permit = FALSE;
```
## 🌎 Language
Most rows are **English**. A meaningful portion of rows imported from
NSText2SQL subsets (notably `nstext2sql-css`, `nstext2sql-mimicsql`, some
medical subsets) are in **Chinese** — reflecting real-world multilingual
SQL corpora. Filter by `source` if you need English-only training.
## 📊 Composition
### By source
| Source tag | Rows | Share |
|------------|------|-------|
| `clinton-text2sql` | 182,252 | 23.9% |
| `gretel-synthetic` | 99,927 | 13.1% |
| `kaxap-llama2` | 81,358 | 10.7% |
| `nstext2sql-wikisql` | 80,442 | 10.6% |
| `sql-create-context` | 78,387 | 10.3% |
| `nstext2sql-sql_create_context` | 75,774 | 10.0% |
| `nstext2sql-nvbench` | 22,929 | 3.0% |
| `motherduck-duckdb` | 22,472 | 3.0% |
| `nstext2sql-css` | 22,050 | 2.9% |
| `nstext2sql-mimicsql_data` | 19,986 | 2.6% |
| `pipable-spider-bird` | 14,018 | 1.8% |
| `nstext2sql-squall` | 10,595 | 1.4% |
| `nstext2sql-sede` | 10,027 | 1.3% |
| `bugdaryan-spider-wikisql` | 7,744 | 1.0% |
| `nstext2sql-eicu` | 7,726 | 1.0% |
| `nstext2sql-mimic_iii` | 7,639 | 1.0% |
| `nstext2sql-spider` | 4,948 | 0.7% |
| `nstext2sql-atis` | 4,894 | 0.6% |
| `nstext2sql-advising` | 4,381 | 0.6% |
| `nstext2sql-criteria2sql` | 1,997 | 0.3% |
| `nstext2sql-scholar` | 768 | 0.1% |
| `nstext2sql-academic` | 185 | 0.0% |
| `nstext2sql-imdb` | 131 | 0.0% |
| `nstext2sql-yelp` | 128 | 0.0% |
| `nstext2sql-restaurants` | 125 | 0.0% |
| `nstext2sql-pesticide` | 50 | 0.0% |
| `nstext2sql-whatcdhiphop` | 41 | 0.0% |
| `nstext2sql-thehistoryofbaseball` | 39 | 0.0% |
| `nstext2sql-uswildfires` | 37 | 0.0% |
| `nstext2sql-geonucleardata` | 32 | 0.0% |
| `nstext2sql-studentmathscore` | 28 | 0.0% |
| `nstext2sql-greatermanchestercrime` | 27 | 0.0% |
| `nstext2sql-worldsoccerdatabase` | 18 | 0.0% |
### By difficulty
| Difficulty | Rows | Share |
|------------|------|-------|
| `easy` | 544,047 | 71.5% |
| `hard` | 109,302 | 14.4% |
| `medium` | 107,806 | 14.2% |
### By dialect
| Dialect | Rows | Share |
|---------|------|-------|
| `generic` | 616,994 | 81.1% |
| `postgres` | 99,927 | 13.1% |
| `duckdb` | 22,472 | 3.0% |
| `sqlite` | 21,762 | 2.9% |
## 📜 Source attribution
This dataset is a derivative combining the sources below. **Please check
each upstream license before commercial use** — the Apache-2.0 license on
this mix covers the build pipeline and curation, not the underlying row
content, which remains under its original license.
| Source | Tag | License | Notes |
|--------|-----|---------|-------|
| [b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) | `sql-create-context` | CC-BY-4.0 | Builds on Spider and WikiSQL; adds CREATE TABLE context. |
| [gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) | `gretel-synthetic` | Apache-2.0 | 105k synthetic examples across 11 domains (finance, healthcare, retail, …). |
| [knowrohit07/know_sql](https://huggingface.co/datasets/knowrohit07/know_sql) | `know_sql` | Apache-2.0 | Compact and clean. Fully deduplicated into sql-create-context in v2. |
| [Clinton/Text-to-sql-v1](https://huggingface.co/datasets/Clinton/Text-to-sql-v1) | `clinton-text2sql` | Apache-2.0 | Large instruction-tuned SQL dataset (262k → 173k post-dedup). |
| [NumbersStation/NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) | `nstext2sql-*` | See source | 290k multi-dialect examples aggregated from 20+ upstream datasets (Spider, WikiSQL, MIMIC-III, ATIS, eICU, …). |
| [ChrisHayduk/Llama-2-SQL-Dataset](https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-Dataset) | `hayduk-llama2-sql` | Apache-2.0 | Llama-2 instruct format. Fully deduplicated into overlapping sources in v2. |
| [motherduckdb/duckdb-text2sql-25k](https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k) | `motherduck-duckdb` | CC-BY-4.0 | 25k DuckDB-dialect SQL by the MotherDuck team. |
| [PipableAI/pip-txt-to-sql-spider-bird-dataset](https://huggingface.co/datasets/PipableAI/pip-txt-to-sql-spider-bird-dataset) | `pipable-spider-bird` | Apache-2.0 | Spider + BIRD benchmarks with inline `CREATE TABLE` schemas. |
| [kaxap/llama2-sql-instruct](https://huggingface.co/datasets/kaxap/llama2-sql-instruct) | `kaxap-llama2` | Apache-2.0 | Llama-2 `[INST]` format; unpacked into schema + question + SQL. |
| [bugdaryan/spider-natsql-wikisql-instruct](https://huggingface.co/datasets/bugdaryan/spider-natsql-wikisql-instruct) | `bugdaryan-spider-wikisql` | Apache-2.0 | Spider + NatSQL + WikiSQL packed into Alpaca format. |
## 🛠️ Build pipeline
Every row in this dataset is produced by this deterministic pipeline:
1. **Download** each source from the HuggingFace Hub.
2. **Normalize** into the common schema shown above — includes regex-
based unpacking of packed Alpaca / Llama-2 formats.
3. **Filter**: drop rows where SQL is unparseable by `sqlglot`,
or `len(instruction) > 2000` / `len(sql) > 4000`.
4. **Deduplicate** by MD5 over `(lower(instruction), lower(sql))`.
5. **Difficulty tagging** — heuristics on JOINs, CTEs, window funcs, nesting.
6. **Shuffle + 95 / 2.5 / 2.5 split** (seed 42).
Build script is open source:
[`training/data_pipelines/build_sql_mix.py`](https://github.com/DanielRegaladoUMiami/sql-agent-llmops/blob/main/training/data_pipelines/build_sql_mix.py)
## 🚀 Usage
### Quick start
```python
from datasets import load_dataset
ds = load_dataset("DanielRegaladoCardoso/text-to-sql-mix-v2")
print(ds["train"][0])
```
### SFT prompt template (Qwen / Llama-3 style)
```python
SYSTEM = "You are a SQL expert. Generate correct SQL given a schema."
def to_sft(row):
user = (
f"### Schema\n{row['schema_context']}\n\n"
f"### Question\n{row['instruction']}\n\n"
"### SQL"
)
return {
"messages": [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": user},
{"role": "assistant", "content": row["sql"]},
]
}
ds = ds.map(to_sft, remove_columns=ds['train'].column_names)
```
### Filter to a subset (e.g. only hard English PostgreSQL)
```python
hard = ds['train'].filter(
lambda r: r['difficulty'] == 'hard' and r['dialect'] == 'postgres'
)
# ~99,927 rows × hard ratio
```
## ⚠️ Known limitations
- **Heuristic difficulty**: the `difficulty` field is derived from surface
features of the SQL, not from human judgment. Use as a rough guide only.
- **Schema quality varies**: some `schema_context` rows are a single
`CREATE TABLE`, others are a full multi-table schema with foreign keys;
a small minority (≈ 0.3% of rows) have an empty schema.
- **Dedup is exact, not semantic**: paraphrased questions with identical
SQL are still counted as distinct rows.
- **SQL dialects are mixed** — `sqlglot` is dialect-agnostic for parsing,
but fine-tuning may benefit from separating dialects by task.
## 📝 Citation
If you use this dataset, please cite both this mix and the original sources:
```bibtex
@dataset{regalado2026textsqlmix,
author = {Regalado Cardoso, Daniel},
title = {Text-to-SQL Training Mix v2},
year = {2026},
url = {https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v2}
}
```
## ⚖️ License
The **build pipeline, curation, and metadata** (this repository's contents)
are released under **Apache-2.0**.
The **row content** inherits the license of its upstream source. See the
attribution table for per-source licenses; favour the strictest license of
the sources you actually use (filter by `source`).
---
_Built with ❤️ in Miami by Daniel Regalado Cardoso · MSBA candidate at the
University of Miami · April 2026._
提供机构:
DanielRegaladoCardoso



