motherduckdb/duckdb-text2sql-25k

Name: motherduckdb/duckdb-text2sql-25k
Creator: motherduckdb
Published: 2024-04-07 09:56:40
License: 暂无描述

Hugging Face2024-04-07 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/motherduckdb/duckdb-text2sql-25k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - text2text-generation language: - en tags: - text-2-sql pretty_name: duckdb-text2sql-25k size_categories: - 10K<n<100K --- # Dataset Summary The duckdb-text2sql-25k dataset contains 25,000 DuckDB text-2-sql pairs covering diverse aspects of DuckDB's SQL syntax. We synthesized this dataset using [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), based on [DuckDB's v0.9.2 documentation](https://duckdb.org/docs/archive/0.9/) and [Spider](https://huggingface.co/datasets/spider) schemas that were translated to DuckDB syntax and enriched with nested type columns. Each training sample consists of a natural language prompt, a corresponding (optional) schema, and a resulting query. Each pair furthermore has a category property that indicates which part of the documentation was used generate the sample. We applied various techniques to validate the syntactical and semantic correctness of the synthesized statements. # How to use it ```python from datasets import load_dataset dataset = load_dataset("motherduckdb/duckdb-text2sql-25k") ``` We recommend using a prompt template similar to the one used for [DuckDB-NSQL-7B](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1#how-to-use) training. # Dataset Structure ## Data Fields - `prompt` (string): the instruction to generate SQL. - `query` (string): the SQL statement. - `schema` (string): the associated schema as CREATE TABLE statements. - `category` (string): the category of the query. # Languages The language of the data is primarily English. # Source Data and Licensing Information Schemas in this dataset are derived from [Spider](https://huggingface.co/datasets/spider), with CC-BY-SA-4.0 License. We publish our dataset under the same license.

提供机构：

motherduckdb

原始信息汇总

数据集概述

数据集名称

pretty_name: duckdb-text2sql-25k

数据集内容

描述: 包含25,000个DuckDB文本到SQL的配对，覆盖DuckDB SQL语法的多个方面。
生成方式: 使用Mixtral 8x7B模型合成，基于DuckDB v0.9.2文档和Spider数据集的架构，经过转换和增强处理。
样本组成: 每个样本包括自然语言提示、相应的（可选）架构和生成的查询，以及指示生成样本所用文档部分的类别属性。

数据集结构

数据字段:
- prompt (字符串): 生成SQL的指令。
- query (字符串): SQL语句。
- schema (字符串): 相关的架构，以CREATE TABLE语句表示。
- category (字符串): 查询的类别。

语言

主要语言: 英语

许可信息

许可证: CC-BY-SA-4.0
源数据许可: 架构来源于Spider数据集，同样使用CC-BY-SA-4.0许可证。

5,000+

优质数据集

54 个

任务类型

进入经典数据集