neurondb/postgresql-llm

Name: neurondb/postgresql-llm
Creator: neurondb
Published: 2026-03-07 06:15:52
License: 暂无描述

Hugging Face2026-03-07 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/neurondb/postgresql-llm

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: apache-2.0 task_categories: - text-generation tags: - postgresql - sql - plpgsql - text-to-sql - code-generation - database - postgres - neurondb pretty_name: "postgresql-llm" size_categories: - 100K<n<1M dataset_info: features: - name: question dtype: string - name: schema dtype: string - name: sql dtype: string - name: explanation dtype: string - name: validation_query dtype: string - name: source dtype: string - name: difficulty dtype: string - name: category dtype: string - name: is_postgresql_specific dtype: bool - name: sql_length dtype: int32 - name: num_statements dtype: int32 splits: - name: train num_examples: 194398 - name: validation num_examples: 13693 - name: test num_examples: 3448 configs: - config_name: postgresql-llm data_files: - split: train path: train.jsonl - split: validation path: validation.jsonl - split: test path: test.jsonl --- # postgresql-llm A pure PostgreSQL dataset for training and evaluating LLMs on **PostgreSQL** SQL and PL/pgSQL. Every row is a (question, schema, SQL) triplet with rich metadata for filtering and analysis. ## Dataset Summary **postgresql-llm** is a pure PostgreSQL dataset: SQL and PL/pgSQL only, with metadata for difficulty, category, and source. | Metric | Value | |--------|-------| | **Total rows** | 211,539 | | **PostgreSQL-specific rows** | 11,998 (5.7%) | | **Schema fill rate** | 82.2% | | **Explanation fill rate** | 17.8% | | **SQL length (median)** | 83 chars | | **SQL length (max)** | 61,419 chars | ## Splits | Split | Rows | |-------|------| | `train` | 194,398 | | `validation` | 13,693 | | `test` | 3,448 | ## Schema Each row contains **11 fields**: | Field | Type | Description | |-------|------|-------------| | `question` | `string` | Natural language instruction or question | | `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable | | `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer | | `explanation` | `string?` | Short explanation of what the SQL does | | `validation_query` | `string?` | Query to validate the answer produces correct results | | `source` | `string` | Origin of this instruction pair (see Sources below) | | `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` | | `category` | `string` | SQL category (see Categories below) | | `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax | | `sql_length` | `int32` | Character length of the SQL field | | `num_statements` | `int32` | Number of SQL statements (semicolon count) | ## Sources Data is aggregated from multiple high-quality sources, each tagged: | Source | Rows | |--------|------| | `community_sql_datasets` | 115,811 | | `sql_create_context` | 78,392 | | `postgresql_regression_tests` | 11,622 | | `pgtap_tests` | 4,181 | | `plpgsql_source` | 1,529 | | `synthetic_text_to_sql` | 4 | ### Source Descriptions - **`postgresql_regression_tests`** — SQL extracted from PostgreSQL's own regression test suite - **`postgresql_docs`** — Examples from official PostgreSQL SGML documentation - **`postgresql_contrib`** — SQL from contrib modules (pg_trgm, hstore, ltree, etc.) - **`pgtap_tests`** — pgTAP unit test SQL - **`plpgsql_source`** — PL/pgSQL functions from the PostgreSQL source tree - **`pgbench_scripts`** — pgbench benchmark scripts - **`handcrafted_advanced`** — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.) - **`sql_create_context`** — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context) - **`synthetic_text_to_sql`** — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation) - **`community_sql_datasets`** — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql) ## Difficulty Distribution | Difficulty | Rows | |------------|------| | `basic` | 147,920 | | `intermediate` | 56,469 | | `advanced` | 7,150 | ## Categories | Category | Rows | |----------|------| | `query_select` | 136,225 | | `query_aggregation` | 32,050 | | `query_join` | 10,597 | | `dml_insert` | 8,763 | | `other` | 4,093 | | `dml_update` | 3,664 | | `dml_delete` | 3,647 | | `ddl_table` | 3,430 | | `query_window_function` | 3,055 | | `plpgsql_function` | 1,912 | | `ddl_advanced` | 1,143 | | `ddl_index` | 806 | | `plpgsql` | 742 | | `ddl_view` | 541 | | `plpgsql_trigger` | 401 | | `ddl_alter` | 235 | | `admin_maintenance` | 125 | | `dcl_security` | 92 | | `query_recursive_cte` | 18 | ## Usage ```python from datasets import load_dataset ds = load_dataset("neurondb/postgresql-llm", "postgresql-llm") # Filter for advanced PostgreSQL-specific queries advanced_pg = ds["train"].filter( lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"] ) # Filter by category window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function") # Filter by source gold = ds["train"].filter( lambda x: x["source"] in [ "postgresql_regression_tests", "postgresql_docs", "handcrafted_advanced", ] ) ``` ## Intended Use - **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation - **Evaluating** text-to-SQL models on PostgreSQL-specific syntax - **Benchmarking** SQL generation quality across difficulty levels - **Building** PostgreSQL-aware coding assistants ## Data Quality - All rows have non-empty `question` and `sql` fields - MySQL-only and T-SQL-only syntax has been filtered out - Duplicate (question, SQL) pairs have been removed - Rows with trivially short SQL (< 10 chars) are excluded - Each row is tagged with source, difficulty, and category for easy filtering ## Examples #### Example 1 — basic / query_select **Source:** `sql_create_context` **Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t? **Schema:** ```sql CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR) ``` **SQL:** ```sql SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t'; ``` #### Example 2 — intermediate / query_join **Source:** `community_sql_datasets` **Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan? **Schema:** ```sql CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT); ``` **SQL:** ```sql SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id; ``` #### Example 3 — advanced / plpgsql_function **Source:** `community_sql_datasets` **Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352). **SQL:** ```sql $$ language plpgsql; select * from sc_test(); create or replace function sc_test() returns setof integer as $$ declare c refcursor; ``` **Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql. #### Example 4 — advanced / query_window_function **Source:** `community_sql_datasets` **Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence? **Schema:** ```sql CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35); ``` **SQL:** ```sql SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education; ``` ## Citation If you use this dataset, please cite: ```bibtex @dataset{neurondb_postgresql_llm_2026, title={postgresql-llm: Pure PostgreSQL SQL \& PL/pgSQL Dataset}, author={NeuronDB Team}, year={2026}, url={https://huggingface.co/datasets/neurondb/postgresql-llm}, } ``` ## License Apache 2.0

提供机构：

neurondb

5,000+

优质数据集

54 个

任务类型

进入经典数据集