jk200201/spider-dpo-1040

Name: jk200201/spider-dpo-1040
Creator: jk200201
Published: 2026-03-26 13:21:09
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/jk200201/spider-dpo-1040

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - text-to-sql - sql - dpo - preference - spider - code task_categories: - text-generation pretty_name: Spider DPO — Frontier Model Preference Pairs size_categories: - 1K<n<10K --- # Spider DPO — Frontier Model Preference Pairs 1,040 DPO preference pairs and 7,000 SFT examples built from the [Spider V1](https://github.com/taoyds/spider) benchmark by running two frontier models (Grok-4 and DeepSeek V3) and extracting preference signal from their disagreements. Used to fine-tune [jk200201/qwen2.5-coder-7b-sql-dpo](https://huggingface.co/jk200201/qwen2.5-coder-7b-sql-dpo), a 7B model that outperforms both frontier models it was trained on. | Model | Spider V1 Result Accuracy | |-------|--------------------------| | **Qwen2.5-Coder-7B (fine-tuned on this dataset)** | **78.2%** | | Grok-4 (xAI) | 73.7% | | DeepSeek V3 | 71.8% | **GitHub:** [finetuning-text-to-sql](https://github.com/jenishkothari/finetuning-text-to-sql) --- ## Files | File | Examples | Description | |------|----------|-------------| | `dpo_data.json` | 1,040 | DPO preference pairs (chosen / rejected SQL) | | `sft_data.json` | 7,000 | SFT examples (Spider train set, gold SQL) | | `dataset_info.json` | — | LLaMA-Factory dataset registry | --- ## How the DPO Pairs Were Built 1. Ran **Grok-4** and **DeepSeek V3** on all 1,034 Spider dev questions 2. Executed both generated SQLs and the gold SQL against the SQLite databases 3. Categorized every question into one of four buckets: | Category | Count | How handled | |----------|-------|-------------| | One model correct, one wrong | 118 | Direct preference pair | | Both correct, identical SQL | 215 | Skipped — no signal | | Both correct, different SQL | 478 | Sent to LLM judge | | Both wrong | 223 questions → 446 pairs | Gold SQL as chosen | 4. For the 478 "both correct, different SQL" pairs: used **Gemini 2.5 Flash** as a third-party judge to pick the better query (position bias controlled via random A/B swap, seed 42) 5. Filtered 2 degenerate pairs → **1,040 usable DPO pairs** ### Chosen source breakdown | Source | Count | |--------|-------| | Gold SQL | 445 | | DeepSeek V3 | 354 | | Grok-4 | 241 | --- ## Data Format ### dpo_data.json — DPO pairs (LLaMA-Factory alpaca format) ```json { "instruction": "You are an expert SQL query generator...\n\n### Schema:\nCREATE TABLE ...\n\n### Question:\nHow many singers performed in 2015?", "input": "", "chosen": "SELECT COUNT(DISTINCT singer_id) FROM singer_in_concert WHERE year = 2015", "rejected": "SELECT COUNT(*) FROM singers WHERE year = 2015" } ``` ### sft_data.json — SFT examples (LLaMA-Factory alpaca format) ```json { "instruction": "You are an expert SQL query generator...\n\n### Schema:\nCREATE TABLE ...\n\n### Question:\nHow many singers performed in 2015?", "input": "", "output": "SELECT COUNT(DISTINCT singer_id) FROM singer_in_concert WHERE year = 2015" } ``` --- ## How to Use ```python import json # Load DPO pairs with open("dpo_data.json") as f: dpo_pairs = json.load(f) print(f"Total DPO pairs: {len(dpo_pairs)}") print(dpo_pairs[0].keys()) # dict_keys(['instruction', 'input', 'chosen', 'rejected']) # Load SFT examples with open("sft_data.json") as f: sft_examples = json.load(f) print(f"Total SFT examples: {len(sft_examples)}") ``` With LLaMA-Factory, register via `dataset_info.json` and reference as `spider_dpo` / `spider_sft` in your training config. --- ## Citation If you use this dataset, please cite the Spider benchmark: ```bibtex @inproceedings{Yu2018Spider, title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task}, author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}, booktitle = {EMNLP}, year = {2018} } ```

提供机构：

jk200201

5,000+

优质数据集

54 个

任务类型

进入经典数据集