five

jk200201/spider-dpo-1040

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/jk200201/spider-dpo-1040
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en tags: - text-to-sql - sql - dpo - preference - spider - code task_categories: - text-generation pretty_name: Spider DPO — Frontier Model Preference Pairs size_categories: - 1K<n<10K --- # Spider DPO — Frontier Model Preference Pairs 1,040 DPO preference pairs and 7,000 SFT examples built from the [Spider V1](https://github.com/taoyds/spider) benchmark by running two frontier models (Grok-4 and DeepSeek V3) and extracting preference signal from their disagreements. Used to fine-tune [jk200201/qwen2.5-coder-7b-sql-dpo](https://huggingface.co/jk200201/qwen2.5-coder-7b-sql-dpo), a 7B model that outperforms both frontier models it was trained on. | Model | Spider V1 Result Accuracy | |-------|--------------------------| | **Qwen2.5-Coder-7B (fine-tuned on this dataset)** | **78.2%** | | Grok-4 (xAI) | 73.7% | | DeepSeek V3 | 71.8% | **GitHub:** [finetuning-text-to-sql](https://github.com/jenishkothari/finetuning-text-to-sql) --- ## Files | File | Examples | Description | |------|----------|-------------| | `dpo_data.json` | 1,040 | DPO preference pairs (chosen / rejected SQL) | | `sft_data.json` | 7,000 | SFT examples (Spider train set, gold SQL) | | `dataset_info.json` | — | LLaMA-Factory dataset registry | --- ## How the DPO Pairs Were Built 1. Ran **Grok-4** and **DeepSeek V3** on all 1,034 Spider dev questions 2. Executed both generated SQLs and the gold SQL against the SQLite databases 3. Categorized every question into one of four buckets: | Category | Count | How handled | |----------|-------|-------------| | One model correct, one wrong | 118 | Direct preference pair | | Both correct, identical SQL | 215 | Skipped — no signal | | Both correct, different SQL | 478 | Sent to LLM judge | | Both wrong | 223 questions → 446 pairs | Gold SQL as chosen | 4. For the 478 "both correct, different SQL" pairs: used **Gemini 2.5 Flash** as a third-party judge to pick the better query (position bias controlled via random A/B swap, seed 42) 5. Filtered 2 degenerate pairs → **1,040 usable DPO pairs** ### Chosen source breakdown | Source | Count | |--------|-------| | Gold SQL | 445 | | DeepSeek V3 | 354 | | Grok-4 | 241 | --- ## Data Format ### dpo_data.json — DPO pairs (LLaMA-Factory alpaca format) ```json { "instruction": "You are an expert SQL query generator...\n\n### Schema:\nCREATE TABLE ...\n\n### Question:\nHow many singers performed in 2015?", "input": "", "chosen": "SELECT COUNT(DISTINCT singer_id) FROM singer_in_concert WHERE year = 2015", "rejected": "SELECT COUNT(*) FROM singers WHERE year = 2015" } ``` ### sft_data.json — SFT examples (LLaMA-Factory alpaca format) ```json { "instruction": "You are an expert SQL query generator...\n\n### Schema:\nCREATE TABLE ...\n\n### Question:\nHow many singers performed in 2015?", "input": "", "output": "SELECT COUNT(DISTINCT singer_id) FROM singer_in_concert WHERE year = 2015" } ``` --- ## How to Use ```python import json # Load DPO pairs with open("dpo_data.json") as f: dpo_pairs = json.load(f) print(f"Total DPO pairs: {len(dpo_pairs)}") print(dpo_pairs[0].keys()) # dict_keys(['instruction', 'input', 'chosen', 'rejected']) # Load SFT examples with open("sft_data.json") as f: sft_examples = json.load(f) print(f"Total SFT examples: {len(sft_examples)}") ``` With LLaMA-Factory, register via `dataset_info.json` and reference as `spider_dpo` / `spider_sft` in your training config. --- ## Citation If you use this dataset, please cite the Spider benchmark: ```bibtex @inproceedings{Yu2018Spider, title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task}, author = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}, booktitle = {EMNLP}, year = {2018} } ```
提供机构:
jk200201
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作