zerolink/zsql-sqlite-dpo

Name: zerolink/zsql-sqlite-dpo
Creator: zerolink
Published: 2024-02-02 18:37:15
License: 暂无描述

Hugging Face2024-02-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/zerolink/zsql-sqlite-dpo

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: other license_link: https://github.com/zerolink-io/zsql-sqlite-dpo dataset_info: features: - name: schema dtype: string - name: question dtype: string - name: rejected dtype: string - name: chosen dtype: string - name: weight dtype: float64 splits: - name: train num_bytes: 244244555.38278434 num_examples: 234268 - name: test num_bytes: 27138515.617215652 num_examples: 26030 download_size: 86245275 dataset_size: 271383071 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* language_creators: - crowdsourced - expert-generated task_categories: - text2text-generation - text-generation language: - en tags: - dpo - text-to-sql - sql size_categories: - 100K<n<1M --- # zsql-sqlite-dpo This is a dataset for training machine learning models to convert natural English language text into SQLite dialect SQL queries. This dataset comprises 200,000 DPO pairs curated to support the rapid development of text-to-SQL generation models. The uniqueness of this dataset lies in its optimization process. The "chosen" field within each data pair contains SQL queries that have been canonicalized, optimized, and which are chosen from the candidate set which minimizes syntactic cyclomatic and asymptotic complexity against the given schema. Direct Preference Optimization (see [Rafailov et al, 2023](https://arxiv.org/abs/2305.18290J)) is a novel approach to refinement learning from positive and negative samples to modify the behavior of large-scale unsupervised language models to align with human preferences This method simplifies the fine-tuning process, making it more stable and computationally efficient without the need for extensive hyperparameter tuning or LM sampling, and has been shown to effectively control model outputs, matching or surpassing existing methods. The source data is cleaned and filtered based on the following criteria: - Remove queries which are not in English. - Remove queries which are not valid SQL queries. - Remove queries which are not executable against the given schema. - Remove queries which are executed against tables with non-Latin characters. - Remove queries which use features not supported by the given database. - Remove long queries which contain domain-specific knowledge which cause model confusion. - Remove queries which do not fit within a 4096 token context window. ## Usage To load the dataset using the HuggingFace `datasets` library: ```python from datasets import load_dataset dataset = load_dataset("zerolink/zsql-sqlite-dpo") ``` To use in model fine-tuning, apply the following chat tokenizer: ```python tokenizer = AutoTokenizer.from_pretrained(model) def tokenize(element): schema = element["schema"] question = element["question"] answer = element["chosen"] prompt = f""" Using the schema: {schema} Generate SQL for the following question: {question} """ system = "Translate English to SQLite SQL." message = [ {"role": "system", "content": system}, {"role": "user", "content": prompt}, {"role": "assistant", "content": answer}, ] output = tokenizer.apply_chat_template( message, add_generation_prompt=False, tokenize=True ) return {"text": output} ``` ## Fields The fields in this dataset are as follows: | Field Name | Description | | ---------- | ----------------------------------------------------------------------------------------------- | | schema | The schema of the database. | | question | The natural language question. | | chosen | The DPO preferred SQL query. | | rejected | The DPO rejected SQL query. | | weight | The weight of the query in the reward function. | ## Sources This dataset is derived from the following sources: | Source | License | External Link | | ---------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------- | | wikisql | BSD 3-Clause | [https://github.com/salesforce/WikiSQL](https://github.com/salesforce/WikiSQL) | | spider | CC-BY-SA-4.0 | [https://huggingface.co/datasets/spider](https://huggingface.co/datasets/spider) | | sql_create_context | CC-BY-4.0 | [https://huggingface.co/datasets/b-mc2/sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) | | squall | CC-BY-SA-4.0 | [https://github.com/tzshi/squall](https://github.com/tzshi/squall) | | sede | Apache-2.0 | [https://github.com/hirupert/sede](https://github.com/hirupert/sede) | | nvbench | MIT | [https://github.com/TsinghuaDatabaseGroup/nvBench](https://github.com/TsinghuaDatabaseGroup/nvBench) | | imdb | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | advising | CC-BY-4.0 | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | atis | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | restaurants | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | scholar | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | yelp | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | academic | Not Found | [https://github.com/jkkummerfeld/text2sql-data](https://github.com/jkkummerfeld/text2sql-data) | | criteria2sql | Apache-2.0 | [https://github.com/xiaojingyu92/Criteria2SQL](https://github.com/xiaojingyu92/Criteria2SQL) | | eICU | CC-BY-4.0 | [https://github.com/glee4810/EHRSQL](https://github.com/glee4810/EHRSQL) | | mimic_iii | CC-BY-4.0 | [https://github.com/glee4810/EHRSQL](https://github.com/glee4810/EHRSQL) | | mimicsql_data | MIT | [https://github.com/wangpinggl/TREQS](https://github.com/wangpinggl/TREQS) | | worldsoccerdatabase | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | whatcdhiphop | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | studentmathscore | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | pesticide | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | thehistoryofbaseball | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | uswildfires | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | geonucleardata | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | | greatermanchestercrime | CC-BY-SA-4.0 | [https://github.com/chiahsuan156/KaggleDBQA](https://github.com/chiahsuan156/KaggleDBQA) | Composition: ![Composition](https://raw.githubusercontent.com/zerolink-io/zsql-sqlite-dpo/d8eb36601fc5cfc35da9bb9d98cc5d72451f7dd4/composition.png) ## License This dataset is provided for academic and research purposes. Please adhere to the specified license terms and conditions for usage and distribution.

提供机构：

zerolink

原始信息汇总

zsql-sqlite-dpo 数据集概述

数据集描述

zsql-sqlite-dpo 数据集用于训练机器学习模型，将自然语言文本转换为 SQLite 方言的 SQL 查询。该数据集包含 200,000 个 DPO 对，旨在支持快速开发文本到 SQL 生成模型。每个数据对中的 "chosen" 字段包含经过规范化、优化并从候选集中选出的 SQL 查询，这些查询在给定模式下最小化了语法圈复杂度和渐近复杂度。

数据集特征

schema: 数据库模式，数据类型为字符串。
question: 自然语言问题，数据类型为字符串。
rejected: 被拒绝的 SQL 查询，数据类型为字符串。
chosen: 选定的 SQL 查询，数据类型为字符串。
weight: 查询在奖励函数中的权重，数据类型为浮点数（float64）。

数据集划分

train: 训练集，包含 234,268 个样本，大小为 244,244,555.38 字节。
test: 测试集，包含 26,030 个样本，大小为 27,138,515.62 字节。

数据集大小

下载大小: 86,245,275 字节。
数据集大小: 271,383,071 字节。

数据集配置

default: 默认配置，包含训练集和测试集的数据文件路径。
- train: data/train-*
- test: data/test-*

语言和任务类别

语言创建者: 众包和专家生成。
任务类别: 文本到文本生成、文本生成。
语言: 英语。
标签: dpo、text-to-sql、sql。
大小类别: 100K<n<1M。

数据清洗和过滤标准

移除非英语查询。
移除非有效 SQL 查询。
移除无法针对给定模式执行的查询。
移除针对包含非拉丁字符表的查询。
移除使用给定数据库不支持功能的查询。
移除包含特定领域知识导致模型混淆的长查询。
移除超出 4096 个令牌上下文窗口的查询。

使用方法

使用 HuggingFace datasets 库加载数据集： python from datasets import load_dataset

dataset = load_dataset("zerolink/zsql-sqlite-dpo")

许可证

该数据集仅供学术和研究目的使用。请遵守指定的许可证条款和条件进行使用和分发。

5,000+

优质数据集

54 个

任务类型

进入经典数据集