wenyupapa/BIRD-Verified-CoT-2462-GPT5.4
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/wenyupapa/BIRD-Verified-CoT-2462-GPT5.4
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-sa-4.0
size_categories:
- 1K<n<10K
task_categories:
- text-generation
- question-answering
task_ids:
- closed-domain-qa
pretty_name: BIRD-Verified-CoT-2462 (GPT-5.4 distilled)
tags:
- text-to-sql
- chain-of-thought
- bird-benchmark
- sqlite
- reasoning
- distillation
- gpt-5.4
- opensearch-sql
configs:
- config_name: default
data_files:
- split: train
path: bird-verified-cot-2462.parquet
- config_name: valid_only
data_files:
- split: train
path: bird-verified-cot-2462-valid.parquet
---
# BIRD-Verified-CoT-2462 (GPT-5.4 distilled)
> **Likely the first publicly available CoT-augmented Text-to-SQL dataset built on top of expert-verified BIRD data.**
This dataset combines two state-of-the-art ingredients:
- **ReViSQL's BIRD-Verified subset** — 2,462 SQL-expert verified examples (multi-round review by UIUC team), eliminating the ~50% annotation noise of the original BIRD train set.
- **GPT-5.4 (via Codex CLI)** — distilled into structured 6-section Chain-of-Thought traces using OpenSearch-SQL's `prompts_fewshot_parse2` methodology.
## Dataset Summary
| Stat | Value |
|---|---|
| **Total examples** | 2,462 |
| **VALID_IDENTICAL rate** | **100.00%** (CoT's `#SQL` exactly matches gold after normalization) |
| **Unique databases** | 69 (subset of BIRD train DBs) |
| **Teacher model** | GPT-5.4 (`reasoning_effort=medium`) |
| **Methodology** | OpenSearch-SQL 5-field structured CoT |
| **Total cost** | ~$27 USD (≈ $0.011 / example) |
| **Generation time** | ~85 minutes (4-shard parallel) |
## Why This Dataset?
### Problem with raw BIRD train
The original BIRD train set (9,428 examples) has been shown to contain **~52% wrong gold SQL** by ReViSQL's expert audit ([arXiv:2603.20004](https://arxiv.org/abs/2603.20004)) and similar findings ([Wretblad 2024](https://arxiv.org/abs/2402.12243), [Jin 2026](https://arxiv.org/abs/2601.08778)). Training or few-shot retrieval on noisy gold SQL teaches models incorrect reasoning.
### Existing CoT datasets inherit the noise
Public BIRD-CoT datasets (e.g. `koookiy/BIRD-SQL-data-train-CoT`) generate CoT directly from the original BIRD SQL — meaning if the gold is wrong, the CoT explains *wrong reasoning*. There is no documented filter step.
### Our approach
1. **Start from ReViSQL's 2,462 expert-verified examples** (each gold SQL went through 2-4 rounds of SQL expert review)
2. **Distill structured 6-section CoT** with GPT-5.4, using OpenSearch-SQL's `prompts_fewshot_parse2` template
3. **Ground every CoT in the verified gold** — string-match validation confirms 100% of generated `#SQL` matches the verified gold
## CoT Structure (6 Sections)
Each example's `reasoning` field follows this exact format:
```text
#reason: 1-3 sentences explaining the high-level query intent
#columns: comma-separated table.column references used in the SQL
#SELECT: natural language mention → column mapping
#values: WHERE filter values
#SQL-like: pseudo-SQL ignoring JOIN details
#SQL: the verified gold SQL (verbatim)
```
The `parsed` field contains these 6 sections as a structured dict.
### Example
**Question**: `Name the titles of the movies with the most ratings.`
**Evidence**: `movie with the most rating refers to MAX(COUNT(ratings));`
**Verified Gold SQL**: complex JOIN + GROUP BY + HAVING with subquery (replacing the noisy original `GROUP BY movie_title ORDER BY COUNT DESC LIMIT 1`)
**Generated CoT**:
```
#reason: The question asks for the movie title(s) whose number of ratings is the highest. This can be decomposed into counting ratings per movie, finding the maximum such count, and returning the corresponding movies.movie_title.
#columns: movies.movie_title, movies.movie_id, ratings.movie_id
#SELECT: "titles of the movies" refer to movies.movie_title, "the movies with the most ratings" refer to COUNT(*) grouped by movies.movie_id and movies.movie_title and compared to the maximum count
#values: none
#SQL-like: Show movies.movie_title, group by movies.movie_id, movies.movie_title, having COUNT(*) = (show COUNT(*), group by movies.movie_id, movies.movie_title, order by COUNT(*) DESC, limit 1)
#SQL: SELECT m.movie_title FROM movies AS m JOIN ratings AS r ON m.movie_id = r.movie_id GROUP BY m.movie_id, m.movie_title HAVING COUNT(*) = (SELECT COUNT(*) FROM movies AS m JOIN ratings AS r ON m.movie_id = r.movie_id GROUP BY m.movie_id, m.movie_title ORDER BY COUNT(*) DESC LIMIT 1);
```
## Data Fields
| Field | Type | Description |
|---|---|---|
| `question_id` | int | Unique identifier (from ReViSQL) |
| `db_id` | str | Database name (matches `ReViSQL/data/ddls/{db_id}.sql`) |
| `question` | str | Natural language question (ReViSQL refined) |
| `evidence` | str | Hint / external knowledge (ReViSQL refined) |
| `SQL` | str | **Verified gold SQL** (after multi-round expert review) |
| `original_question` | str | Original BIRD question (may be noisy) |
| `original_evidence` | str | Original BIRD evidence (may be noisy) |
| `original_SQL` | str | Original BIRD gold SQL (~52% wrong per ReViSQL audit) |
| `grading_method` | str | ReViSQL's grading scheme tag |
| `schema` | str | DDL schema (CREATE TABLE statements) used at generation time |
| `reasoning` | str | Full 6-section CoT output from GPT-5.4 |
| `parsed` | dict | Parsed 6 sections (`reason`, `columns`, `select`, `values`, `sql_like`, `sql`) |
| `validation_status` | str | `VALID_IDENTICAL` (100%) — `#SQL` matches gold after normalization |
| `validation_method` | str | `string_match_normalized` |
| `generation_time_ms` | int | Wall-clock generation time per example |
| `prompt_chars` / `response_chars` | int | Character counts |
| `rough_cost_estimate_usd` | float | Per-example cost estimate (CLI doesn't report actuals) |
| `teacher_model` | str | `gpt-5.4` |
| `teacher_backend` | str | `codex-cli` |
| `methodology` | str | `opensearch_structured_5field_v1_ddl` |
| `timestamp` | str | ISO-8601 generation timestamp |
`prompt_tokens`, `completion_tokens`, `reasoning_tokens`, `total_cost_usd` are present but `null` because the Codex CLI does not stably surface these per-call.
## Configurations
- `default` — full 2,462 examples
- `valid_only` — same 2,462 (since 100% are `VALID_IDENTICAL` in this release)
```python
from datasets import load_dataset
ds = load_dataset("wenyupapa/BIRD-Verified-CoT-2462-GPT5.4", split="train")
print(ds[0])
```
## Intended Uses
- **Few-shot retrieval pool** for Text-to-SQL systems (replaces noisy raw BIRD few-shots)
- **SFT training data** for distilling reasoning capabilities into smaller SQL models
- **Schema-linking training signal** — `#columns` field provides ground-truth column relevance
- **Method comparison** — clean baseline for comparing CoT generation strategies
### Out of Scope
- This dataset is **not** for evaluating SQL execution correctness directly — see BIRD's official dev/test sets for that
- The `#SQL` field equals the verified gold by design — do not use it as a generation target without considering this
- 69 BIRD train databases are NOT included — use ReViSQL's `ddls/` (also publicly available) for schema
## Methodology
### Source
- **Questions / SQL**: `ReViSQL/data/bird-verified-{train,val}.json` (UIUC, [github.com/uiuc-kang-lab/ReViSQL](https://github.com/uiuc-kang-lab/ReViSQL))
- **Schema (DDL)**: `ReViSQL/data/ddls/{db_id}.sql`
### Distillation Pipeline
```
For each of 2,462 examples:
1. Load gold + question + evidence + DDL schema
2. Build OpenSearch-SQL `prompts_fewshot_parse2` template (4 in-context examples)
3. Call GPT-5.4 via codex CLI (reasoning_effort=medium, prompt via stdin)
4. Parse 6 sections (#reason, #columns, #SELECT, #values, #SQL-like, #SQL)
5. String-match validation (normalize_sql(cot.sql) == normalize_sql(gold))
```
### Validation
Because the user's environment does not contain BIRD train databases, **execution-based validation is not performed**. Instead, we use **normalized string match**:
```python
def normalize_sql(sql):
sql = re.sub(r"\s+", " ", sql).strip()
sql = sql.rstrip(";").rstrip()
return sql.lower()
# 100% of examples pass normalize_sql(cot.sql) == normalize_sql(gold_sql)
```
This confirms GPT-5.4 fully respected the verified gold SQL (no hallucinations, no rewrites).
## Comparison with Related Datasets
| Dataset | Source | Noise | CoT | Methodology |
|---|---|---|---|---|
| `koookiy/BIRD-SQL-data-train-CoT` | Raw BIRD train (~50% noisy) | Inherited | DeepSeek-R1 (?) | Undocumented |
| `seeklhy/SynSQL-2.5M` | Synthetic (no BIRD) | None | Yes | OmniSQL multi-vote |
| `shehab44/bird23-train-filtered` | BIRD train + execution filter | Reduced | **No** | Arctic-Text2SQL filter |
| **This dataset** | **ReViSQL Verified** (cleaned) | **None** | **GPT-5.4 (verified)** | **OpenSearch 5-field** |
## Limitations
1. **Domain coverage** — Limited to 69 BIRD train databases; not all SQL domains represented
2. **No execution verification** — String match used as proxy (justified: gold is expert-verified)
3. **`#SQL` ≡ gold** — Cannot be used as supervised generation target naively
4. **English only** — Questions are English; multi-lingual SQL not addressed
## Citation
If you use this dataset, please cite the upstream sources:
```bibtex
@article{revisql2025,
title={ReViSQL: Verified BIRD Training Data via Expert Review},
author={Zhu and Jin and Choi and Kang},
journal={arXiv preprint arXiv:2603.20004},
year={2026}
}
@article{opensearchsql2025,
title={OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment},
author={Xie, Xiangjin and Xu, Guangwei and Zhao, Lingyan and Guo, Ruijie},
journal={Proceedings of the ACM on Management of Data (SIGMOD)},
year={2025}
}
@article{birdbench2024,
title={Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs},
author={Li, Jinyang and others},
journal={NeurIPS},
year={2023}
}
```
## License
CC BY-SA 4.0 (inheriting from BIRD-bench's original license).
## Acknowledgments
- **ReViSQL team** (UIUC Kang Lab) for the expert-verified BIRD subset
- **OpenSearch-SQL team** (Alibaba Cloud) for the structured 5-field CoT methodology
- **BIRD-bench team** for the underlying benchmark
- **OpenAI** for GPT-5.4
- **Anthropic** for Claude (used in dataset card preparation)
## Contact
Issues / feedback: please open a discussion on this dataset's HuggingFace page.
提供机构:
wenyupapa



