birdsql/livesqlbench-large-v1

Name: birdsql/livesqlbench-large-v1
Creator: birdsql
Published: 2026-03-02 12:49:55
License: 暂无描述

Hugging Face2026-03-02 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/birdsql/livesqlbench-large-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 configs: - config_name: livesqlbench data_files: - path: livesqlbench_large_v1_data.jsonl split: dev viewer: true tags: - text-to-sql - database --- # 🚀 LiveSQLBench-Large-v1 *A dynamic, **contamination‑free** benchmark for evaluating LLMs on complex, real‑world **text‑to‑SQL** tasks at **industrial scale**.* [🌐 Website/Leaderboard](https://livesqlbench.ai) • [📄 Paper (coming soon)](https://arxiv.org) • [💻 GitHub](https://github.com/bird-bench/livesqlbench) • [🗄️ LiveSQLBench-Base-Lite](https://huggingface.co/datasets/birdsql/livesqlbench-base-lite) • [🗄️ LiveSQLBench-Base-Full-v1](https://huggingface.co/datasets/birdsql/livesqlbench-base-full-v1) • [🗄️ Bird-Interact](https://bird-interact.github.io) (ICLR 2026 Oral) Maintained by the **🦜 [BIRD Team @ HKU](https://bird-bench.github.io)** & **☁️ [Google Cloud](https://cloud.google.com/)** ## 📊 LiveSQLBench Overview **LiveSQLBench** (BIRD-SQL Pro v0.5) is a **contamination-free**, **continuously evolving** benchmark designed to evaluate LLMs on **complex, real-world text-to-SQL tasks**, featuring **diverse real-world user queries**, including **Business Intelligence (BI)**, **CRUD operations**, and more. Each release will include **around 20 new, fully open-source DBs** curated by the BIRD team through expert collaboration and continuous improvement. It will cover a **wide range of database sizes**, from **end-user level** (around 127 columns) to **industrial level** (~1000 columns). LiveSQLBench has been the infrastructure for the **[BIRD-Interact](https://bird-interact.github.io) (ICLR 2026 Oral)** , a dynamic human-centric interactive Text-to-SQL benchmark. Here are the features of the LiveSQLBench benchmark: 1. **🗄️ Live Databases:** Constructed dynamically from extensive and regularly updated CSV datasets, with both base (user-end level) and large (industrial level) versions (1340+ columns each DB) to test scalability. 2. **💬 Live User Queries and SQL:** Each task pairs unambiguous user queries with annotated, gold-standard SQL statements. The user queries are grounded in an external knowledge base, with medium to hard complexity solution SQL statements. 3. **🧠 Contextual Reasoning (HKB):** Every DB includes a hierarchical knowledge base (HKB) where each knowledge may have dependencies to others, which requires the multi-hop reasoning ability. Two HKB formats are provided: (1) structured JSON format, and (2) unstructured Document format. 4. **🔍 The First Full SQL Spectrum:** Supports not just SELECT (Business Intelligence) queries, but also CRUD (e.g., UPDATE, CREATE, and other database management operations) queries. 5. **⚡ Automated Evaluation:** Support fast evaluation via PostgreSQL template & docker. Each question includes verifiable test cases for accurate, reproducible scoring. Soft EX metric is used to evaluate SELECT-ONLY tasks; customized test cases are designed for DBA tasks, such as CRUD (CREATE, READ, UPDATE, DELETE). 6. **🔄 Truly Live & Hidden Test:** New databases and tasks are added over time. Each release features both open development and hidden test phases. The hidden test set from each release becomes the open development set for the next release, ensuring continuous evolution and fair evaluation. ## Previous Releases - [LiveSQLBench-Base-Lite](https://huggingface.co/datasets/birdsql/livesqlbench-base-lite) - [LiveSQLBench-Base-Lite-Sqlite](https://huggingface.co/datasets/birdsql/livesqlbench-base-lite-sqlite) - [LiveSQLBench-Base-Full-v1](https://huggingface.co/datasets/birdsql/livesqlbench-base-full-v1) ## 🎯 Current Release: LiveSQLBench-Large-v1 We are pleased to release **LiveSQLBench-Large-v1**, the **industrial-scale** counterpart of [LiveSQLBench-Base-Full-v1](https://huggingface.co/datasets/birdsql/livesqlbench-base-full-v1). This release contains **18 industrial-level databases** with **480 tasks**, **HKB-JSON** and the **JSON operation in SQL**. ### What's Different from Base-Full-v1? LiveSQLBench-Large-v1 massively scales up the database complexity to simulate **real-world industrial environments**: - **~1K columns per database** (986 avg cols/db vs. 93 in Base-Full-v1) — a **10x** increase in schema complexity. - **~54 tables per database** (vs. 11 in Base-Full-v1) — nearly **5x** more tables per database. - **~2M total rows** across all databases (vs. 275K in Base-Full-v1). - **~84K prompt tokens on average** (vs. 14K in Base-Full-v1) — a **6x** increase, pushing the limits of long-context understanding. - **Business Rule Drift ("Live" Context Learning)**: Versioned knowledge-base settings are introduced to capture evolving business definitions (e.g., a concept is "A" in v1.0 but "B" in v1.1), exposing temporal rule drift and cross-version inconsistencies. These factors make LiveSQLBench-Large-v1 a rigorous stress test for live and long-context reasoning, schema navigation, and SQL generation at scale. ### Key Features - **18 Industrial-Scale Databases**: Each database contains ~54 tables and ~986 columns on average, reflecting real-world enterprise schema complexity. - **480 Expert-Curated Tasks**: Covering both SELECT (BI) and Management (CRUD) operations. - **Long-Context Challenge**: With an average prompt length of ~84K tokens (min ~59K, max ~121K), models must effectively process and reason over massive schema and knowledge contexts. - **Business Rule Drift ("Live" Context Learning)**: Business rules and definitions can evolve and conflict over time in real business scenarios, requiring context-aware reasoning aligned with user intent and demand. Thus, we will introduce versioned knowledge-base settings (e.g. a business concept is defined as "A" in v1.0 but "B" in v1.1) that explicitly expose temporal rule drift and cross-version inconsistencies for temporal context-learning evaluation. ## 💻 How to Use the Dataset ### Get the Dataset Download the dataset containing data file `livesqlbench_large_v1_data.jsonl` and DB metafiles (including schema, HKB, column meaning files) by: ```bash git clone https://huggingface.co/datasets/birdsql/livesqlbench-large-v1 ``` ### ❗️❗️❗️Get the Ground Truth & Test Case by Email❗️❗️❗️ To prevent data leakage through automated crawling, please request access to the ground truth and test cases by emailing **[📧 bird.bench25@gmail.com](mailto:bird.bench25@gmail.com)** with the subject line `[livesqlbench-large-v1 GT&Test Cases]`. An automated response will provide these data fields within 30 mins. ### Get the Database DDL Dumps and Building Scripts The complete PostgreSQL **database dumps** and **building scripts** (`init-databases_postgresql_large.sh`) can be download from [the Google Drive](https://drive.google.com/file/d/1u1L-SvJtOZGfcIST-dINw8DnGEQDMu6C/view?usp=sharing). ### Evaluation The details of usage and evaluation can be referred to [livesqlbench repo](https://github.com/bird-bench/livesqlbench). ## 📁 Directory Structure Each database has its own directory: ``` . ├── README.md ├── database_name_large │ ├── database_name_large_column_meaning_base.json │ ├── database_name_large_kb.jsonl │ ├── database_name_large_schema.txt ... ├── livesqlbench_large_v1_data.jsonl ``` ### 📂 Directory Contents: * `*_schema.txt`: Database schema. * `*_kb.jsonl`: Hierarchical knowledge base entries required to solve the user task. * `id`: The unique identifier for the knowledge. * `knowledge`: The name of the knowledge. * `description`: The description of the knowledge. * `definition`: The clear definition of the knowledge. * `type`: The type of the knowledge. * `children_knowledge`: A list of knowledge IDs that the current knowledge is dependent on. -1 means no children. * `*_column_meaning_base.json`: Explanation of database columns. ## 📋 Dataset Fields (`livesqlbench_large_v1_data.jsonl`): * **instance\_id**: Unique task identifier. * **selected\_database**: Associated database name. * **query**: More natural user query (which is used in evaluation and our leaderboard). * **normal_query**: The normal user query, which is more concise and direct. Just for reference. * **sol\_sql** 🔒: Ground truth SQL solution. * **external\_knowledge** 🔒: IDs of required external knowledge to solve the user task. * **preprocess\_sql**: SQL setup queries. * **clean\_up\_sql**: SQL queries to reset database state. * **test\_cases** 🔒: Test cases to validate the predicted corrected SQL. * **category**: "Query" (SELECT-only) or "Management" (CRUD). * **high\_level**: Boolean indicating whether the user query contains high-level description. * **conditions**: Indicates decimal/distinct conditions in the user query. * **difficulty\_tier**: Task difficulty (Simple, Moderate, Challenging). ## 🔒 Accessing Complete Data To avoid data leakage by auto-crawling, certain fields (e.g., `sol_sql`, `test_cases`, `external_knowledge`) are excluded from the public dataset. For the full dataset, please email: **[📧 bird.bench25@gmail.com](mailto:bird.bench25@gmail.com)** with subject tag `[livesqlbench-large-v1 GT&Test Cases]`, which will be sent automatically within 30 mins. ## 🏆 Model Performance on LiveSQLBench-Large-v1 Severla models have been evaluated on LiveSQLBench-Large-v1, the results are as follows: For more details, please refer to our homepage: [🌐 LiveSQLBench](https://livesqlbench.ai) | Model | Success Rate (%) | |---|---:| | claude-opus-4.6 | 33.96 | | gpt-5.3-codex | 29.38 | ## 🔄 Stay Tuned! Upcoming releases: - [x] **🔄 LiveSQLBench-Base-Lite-Sqlite:** SQLite version of LiveSQLBench-Base-Lite. No need to setup the docker. - [x] **🔄 LiveSQLBench-Base-Full:** 600 tasks with more natural user queries and real DBs. - [x] **🔄 LiveSQLBench-Large-v1:** Industrial-scale databases with ~1000 columns. - [ ] **🔄 Keep it Live!**: New DB envs and tasks will be added periodically with **hidden test sets**, keeping the benchmark truly live and contamination-free. - [ ] **💻 LiveSQLBench-CLI**: CLI-interactive variant adapted for coding agents (e.g., Claude Code, OpenHands), where agents explore databases and iteratively solve SQL tasks through terminal workflows. Want new dialects? Vote for new SQL dialects [🗳️ here](https://docs.google.com/forms/d/e/1FAIpQLSfEogmsA7LObI13KOoiojdnYfW28KEqvEVtC9hXaZJ8O9aCpQ/viewform?usp=header)! ## 📄 License: cc-by-sa-4.0

提供机构：

birdsql

5,000+

优质数据集

54 个

任务类型

进入经典数据集