DanielRegaladoCardoso/text-to-sql-mix-v1
收藏Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/text-to-sql-mix-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
task_categories:
- text-generation
- text2text-generation
tags:
- sql
- text-to-sql
- code-generation
- instruction-tuning
pretty_name: Text-to-SQL Training Mix v1
size_categories:
- 100K<n<1M
---
# Text-to-SQL Training Mix v1
A curated, deduplicated and quality-filtered mix of six high-quality
text-to-SQL datasets from HuggingFace, designed for fine-tuning code LLMs
(Qwen 2.5 Coder, DeepSeek Coder, Llama-3, etc.) on SQL generation.
This dataset powers the SQL Generator in the [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project.
## Schema
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Stable hash-based identifier |
| `instruction` | string | Natural language question / instruction |
| `schema_context` | string | Database schema (CREATE TABLE or prose) — may be empty |
| `sql` | string | Target SQL query (parseable via sqlglot) |
| `source` | string | Original dataset tag |
| `dialect` | string | SQL dialect hint (generic, postgres, mysql, sqlite) |
| `difficulty` | string | easy / medium / hard (heuristic) |
## Splits
| Split | Examples |
|-------|----------|
| train | 603,784 |
| validation | 15,889 |
| test | 15,890 |
Split ratios: 95% train / 2.5% validation / 2.5% test
## Source attribution
This dataset is a derivative work combining the following sources:
| Source | Tag in `source` | License | Link | Notes |
|--------|-----------------|---------|------|-------|
| `b-mc2/sql-create-context` | `sql-create-context` | CC-BY-4.0 | [link](https://huggingface.co/datasets/b-mc2/sql-create-context) | Builds on Spider and WikiSQL; adds CREATE TABLE schema context. |
| `gretelai/synthetic_text_to_sql` | `gretel-synthetic` | Apache-2.0 | [link](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) | 105k synthetic SQL examples across 11 domains (finance, healthcare, retail, etc.). |
| `knowrohit07/know_sql` | `know_sql` | Apache-2.0 | [link](https://huggingface.co/datasets/knowrohit07/know_sql) | Compact and clean text-to-SQL pairs. |
| `Clinton/Text-to-sql-v1` | `clinton-text2sql` | Apache-2.0 | [link](https://huggingface.co/datasets/Clinton/Text-to-sql-v1) | Large instruction-tuned SQL dataset. |
| `NumbersStation/NSText2SQL` | `nstext2sql-*` | See source | [link](https://huggingface.co/datasets/NumbersStation/NSText2SQL) | 290k examples from 20+ sources, multi-dialect. |
| `ChrisHayduk/Llama-2-SQL-Dataset` | `hayduk-llama2-sql` | Apache-2.0 | [link](https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-Dataset) | Llama-2 instruction format text-to-SQL. |
## Per-source statistics
| Source | Examples |
|--------|----------|
| `clinton-text2sql` | 182,252 |
| `gretel-synthetic` | 99,927 |
| `nstext2sql-wikisql` | 80,442 |
| `sql-create-context` | 78,387 |
| `nstext2sql-sql_create_context` | 75,774 |
| `nstext2sql-nvbench` | 22,929 |
| `nstext2sql-css` | 22,050 |
| `nstext2sql-mimicsql_data` | 19,986 |
| `nstext2sql-squall` | 10,595 |
| `nstext2sql-sede` | 10,027 |
| `nstext2sql-eicu` | 7,726 |
| `nstext2sql-mimic_iii` | 7,639 |
| `nstext2sql-spider` | 4,948 |
| `nstext2sql-atis` | 4,894 |
| `nstext2sql-advising` | 4,381 |
| `nstext2sql-criteria2sql` | 1,997 |
| `nstext2sql-scholar` | 768 |
| `nstext2sql-academic` | 185 |
| `nstext2sql-imdb` | 131 |
| `nstext2sql-yelp` | 128 |
| `nstext2sql-restaurants` | 125 |
| `nstext2sql-pesticide` | 50 |
| `nstext2sql-whatcdhiphop` | 41 |
| `nstext2sql-thehistoryofbaseball` | 39 |
| `nstext2sql-uswildfires` | 37 |
| `nstext2sql-geonucleardata` | 32 |
| `nstext2sql-studentmathscore` | 28 |
| `nstext2sql-greatermanchestercrime` | 27 |
| `nstext2sql-worldsoccerdatabase` | 18 |
## Difficulty distribution
| Difficulty | Examples |
|------------|----------|
| `easy` | 442,345 |
| `medium` | 102,273 |
| `hard` | 90,945 |
## Pipeline
1. Download each source from the HuggingFace Hub.
2. Normalize into the common schema shown above.
3. Filter: reject rows where SQL is unparseable by `sqlglot`, or
where instruction/SQL exceed length thresholds.
4. Deduplicate by MD5 hash of `(lower(instruction), lower(sql))`.
5. Heuristic difficulty tagging based on SQL complexity signals.
6. Stratified shuffle + 95/2.5/2.5 split.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("DanielRegaladoCardoso/text-to-sql-mix-v1")
def format_example(ex):
ctx = ex['schema_context']
prompt = (
f"You are a SQL expert. Given the schema below, write a SQL query.\n\n"
f"### Schema:\n{ctx}\n\n"
f"### Question:\n{ex['instruction']}\n\n"
f"### SQL:\n"
)
return {'prompt': prompt, 'completion': ex['sql']}
ds = ds.map(format_example)
```
## Citation
If you use this dataset, please cite the original sources and this mix:
```bibtex
@dataset{regalado2026textsqlmix,
author = {Regalado Cardoso, Daniel},
title = {text-to-sql-mix-v1},
year = {2026},
url = {https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v1}
}
```
## License
This mix is released under **Apache-2.0**. Individual source licenses
remain attached to their respective rows — see the source attribution
table above. Please review the license of each upstream dataset before
using this mix in commercial applications.
提供机构:
DanielRegaladoCardoso



