five

DanielRegaladoCardoso/text-to-sql-mix-v1

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/DanielRegaladoCardoso/text-to-sql-mix-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation - text2text-generation tags: - sql - text-to-sql - code-generation - instruction-tuning pretty_name: Text-to-SQL Training Mix v1 size_categories: - 100K<n<1M --- # Text-to-SQL Training Mix v1 A curated, deduplicated and quality-filtered mix of six high-quality text-to-SQL datasets from HuggingFace, designed for fine-tuning code LLMs (Qwen 2.5 Coder, DeepSeek Coder, Llama-3, etc.) on SQL generation. This dataset powers the SQL Generator in the [SQL Agent LLMOps](https://github.com/DanielRegaladoUMiami/sql-agent-llmops) project. ## Schema | Field | Type | Description | |-------|------|-------------| | `id` | string | Stable hash-based identifier | | `instruction` | string | Natural language question / instruction | | `schema_context` | string | Database schema (CREATE TABLE or prose) — may be empty | | `sql` | string | Target SQL query (parseable via sqlglot) | | `source` | string | Original dataset tag | | `dialect` | string | SQL dialect hint (generic, postgres, mysql, sqlite) | | `difficulty` | string | easy / medium / hard (heuristic) | ## Splits | Split | Examples | |-------|----------| | train | 603,784 | | validation | 15,889 | | test | 15,890 | Split ratios: 95% train / 2.5% validation / 2.5% test ## Source attribution This dataset is a derivative work combining the following sources: | Source | Tag in `source` | License | Link | Notes | |--------|-----------------|---------|------|-------| | `b-mc2/sql-create-context` | `sql-create-context` | CC-BY-4.0 | [link](https://huggingface.co/datasets/b-mc2/sql-create-context) | Builds on Spider and WikiSQL; adds CREATE TABLE schema context. | | `gretelai/synthetic_text_to_sql` | `gretel-synthetic` | Apache-2.0 | [link](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) | 105k synthetic SQL examples across 11 domains (finance, healthcare, retail, etc.). | | `knowrohit07/know_sql` | `know_sql` | Apache-2.0 | [link](https://huggingface.co/datasets/knowrohit07/know_sql) | Compact and clean text-to-SQL pairs. | | `Clinton/Text-to-sql-v1` | `clinton-text2sql` | Apache-2.0 | [link](https://huggingface.co/datasets/Clinton/Text-to-sql-v1) | Large instruction-tuned SQL dataset. | | `NumbersStation/NSText2SQL` | `nstext2sql-*` | See source | [link](https://huggingface.co/datasets/NumbersStation/NSText2SQL) | 290k examples from 20+ sources, multi-dialect. | | `ChrisHayduk/Llama-2-SQL-Dataset` | `hayduk-llama2-sql` | Apache-2.0 | [link](https://huggingface.co/datasets/ChrisHayduk/Llama-2-SQL-Dataset) | Llama-2 instruction format text-to-SQL. | ## Per-source statistics | Source | Examples | |--------|----------| | `clinton-text2sql` | 182,252 | | `gretel-synthetic` | 99,927 | | `nstext2sql-wikisql` | 80,442 | | `sql-create-context` | 78,387 | | `nstext2sql-sql_create_context` | 75,774 | | `nstext2sql-nvbench` | 22,929 | | `nstext2sql-css` | 22,050 | | `nstext2sql-mimicsql_data` | 19,986 | | `nstext2sql-squall` | 10,595 | | `nstext2sql-sede` | 10,027 | | `nstext2sql-eicu` | 7,726 | | `nstext2sql-mimic_iii` | 7,639 | | `nstext2sql-spider` | 4,948 | | `nstext2sql-atis` | 4,894 | | `nstext2sql-advising` | 4,381 | | `nstext2sql-criteria2sql` | 1,997 | | `nstext2sql-scholar` | 768 | | `nstext2sql-academic` | 185 | | `nstext2sql-imdb` | 131 | | `nstext2sql-yelp` | 128 | | `nstext2sql-restaurants` | 125 | | `nstext2sql-pesticide` | 50 | | `nstext2sql-whatcdhiphop` | 41 | | `nstext2sql-thehistoryofbaseball` | 39 | | `nstext2sql-uswildfires` | 37 | | `nstext2sql-geonucleardata` | 32 | | `nstext2sql-studentmathscore` | 28 | | `nstext2sql-greatermanchestercrime` | 27 | | `nstext2sql-worldsoccerdatabase` | 18 | ## Difficulty distribution | Difficulty | Examples | |------------|----------| | `easy` | 442,345 | | `medium` | 102,273 | | `hard` | 90,945 | ## Pipeline 1. Download each source from the HuggingFace Hub. 2. Normalize into the common schema shown above. 3. Filter: reject rows where SQL is unparseable by `sqlglot`, or where instruction/SQL exceed length thresholds. 4. Deduplicate by MD5 hash of `(lower(instruction), lower(sql))`. 5. Heuristic difficulty tagging based on SQL complexity signals. 6. Stratified shuffle + 95/2.5/2.5 split. ## Usage ```python from datasets import load_dataset ds = load_dataset("DanielRegaladoCardoso/text-to-sql-mix-v1") def format_example(ex): ctx = ex['schema_context'] prompt = ( f"You are a SQL expert. Given the schema below, write a SQL query.\n\n" f"### Schema:\n{ctx}\n\n" f"### Question:\n{ex['instruction']}\n\n" f"### SQL:\n" ) return {'prompt': prompt, 'completion': ex['sql']} ds = ds.map(format_example) ``` ## Citation If you use this dataset, please cite the original sources and this mix: ```bibtex @dataset{regalado2026textsqlmix, author = {Regalado Cardoso, Daniel}, title = {text-to-sql-mix-v1}, year = {2026}, url = {https://huggingface.co/datasets/DanielRegaladoCardoso/text-to-sql-mix-v1} } ``` ## License This mix is released under **Apache-2.0**. Individual source licenses remain attached to their respective rows — see the source attribution table above. Please review the license of each upstream dataset before using this mix in commercial applications.
提供机构:
DanielRegaladoCardoso
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作