five

mwaldrop/heavydb-text-to-sql

收藏
Hugging Face2025-12-10 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/mwaldrop/heavydb-text-to-sql
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - text2text-generation language: - en tags: - text-to-sql - sql - heavydb - geospatial - code - nlp size_categories: - 1K<n<10K --- # HeavyDB Text-to-SQL Dataset A dataset for training language models to convert natural language questions to **HeavyDB SQL queries**. ## Overview [HeavyDB](https://www.heavy.ai/) is a GPU-accelerated SQL database with powerful geospatial support. This dataset contains question-SQL pairs specifically designed for HeavyDB syntax, including geospatial queries using ST_* functions. ## Dataset Statistics | Split | Examples | |-------|----------| | Train | 8,217 | | Validation | 965 | | Test | 484 | | **Total** | **9,666** | ### SQL Pattern Distribution | Pattern | Percentage | |---------|------------| | SELECT | 99.5% | | WHERE | 63.9% | | JOIN | 44.3% | | GROUP BY | 23.8% | | ST_* (geospatial) | 8.0% | ## Usage ```python from datasets import load_dataset dataset = load_dataset("mwaldrop/heavydb-text-to-sql") # Access training data for example in dataset["train"]: print(f"Question: {example['question']}") print(f"SQL: {example['query']}") break ``` ## Data Format Each example contains: | Field | Description | |-------|-------------| | `instruction` | Task description for instruction-tuning | | `input` | The natural language question | | `output` | The corresponding SQL query | | `question` | Raw question text | | `query` | Raw SQL query | | `source` | Origin of the example | | `db_id` | Database identifier | | `dataset` | Source dataset name | ## Example ``` Question: How many heads of the departments are older than 56? SQL: SELECT COUNT(*) AS num_heads FROM head WHERE age > 56; ``` ## Recommended Models for Fine-tuning This dataset works well with: - [SQLCoder](https://huggingface.co/defog/sqlcoder-7b-2) - Purpose-built for SQL - [CodeLlama](https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf) - Strong code understanding - [DeepSeek-Coder](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) - Excellent for code tasks ## Training Tips 1. Use QLoRA for efficient fine-tuning on consumer GPUs 2. Include the database schema in prompts for better accuracy 3. Validate generated SQL against HeavyDB before deployment ## License Apache 2.0 ## Citation ```bibtex @dataset{heavydb_text_to_sql_2024, title={HeavyDB Text-to-SQL Dataset}, author={mwaldrop}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/datasets/mwaldrop/heavydb-text-to-sql} } ```
提供机构:
mwaldrop
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作