motherduckdb/duckdb-text2sql-25k
收藏Hugging Face2024-04-07 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/motherduckdb/duckdb-text2sql-25k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- text2text-generation
language:
- en
tags:
- text-2-sql
pretty_name: duckdb-text2sql-25k
size_categories:
- 10K<n<100K
---
# Dataset Summary
The duckdb-text2sql-25k dataset contains 25,000 DuckDB text-2-sql pairs covering diverse aspects of DuckDB's SQL syntax.
We synthesized this dataset using [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), based on [DuckDB's v0.9.2 documentation](https://duckdb.org/docs/archive/0.9/) and [Spider](https://huggingface.co/datasets/spider) schemas that were translated to DuckDB syntax and enriched with nested type columns.
Each training sample consists of a natural language prompt, a corresponding (optional) schema, and a resulting query. Each pair furthermore has a category property that indicates which part of the documentation was used generate the sample.
We applied various techniques to validate the syntactical and semantic correctness of the synthesized statements.
# How to use it
```python
from datasets import load_dataset
dataset = load_dataset("motherduckdb/duckdb-text2sql-25k")
```
We recommend using a prompt template similar to the one used for [DuckDB-NSQL-7B](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1#how-to-use) training.
# Dataset Structure
## Data Fields
- `prompt` (string): the instruction to generate SQL.
- `query` (string): the SQL statement.
- `schema` (string): the associated schema as CREATE TABLE statements.
- `category` (string): the category of the query.
# Languages
The language of the data is primarily English.
# Source Data and Licensing Information
Schemas in this dataset are derived from [Spider](https://huggingface.co/datasets/spider), with CC-BY-SA-4.0 License. We publish our dataset under the same license.
提供机构:
motherduckdb
原始信息汇总
数据集概述
数据集名称
- pretty_name: duckdb-text2sql-25k
数据集内容
- 描述: 包含25,000个DuckDB文本到SQL的配对,覆盖DuckDB SQL语法的多个方面。
- 生成方式: 使用Mixtral 8x7B模型合成,基于DuckDB v0.9.2文档和Spider数据集的架构,经过转换和增强处理。
- 样本组成: 每个样本包括自然语言提示、相应的(可选)架构和生成的查询,以及指示生成样本所用文档部分的类别属性。
数据集结构
- 数据字段:
prompt(字符串): 生成SQL的指令。query(字符串): SQL语句。schema(字符串): 相关的架构,以CREATE TABLE语句表示。category(字符串): 查询的类别。
语言
- 主要语言: 英语
许可信息
- 许可证: CC-BY-SA-4.0
- 源数据许可: 架构来源于Spider数据集,同样使用CC-BY-SA-4.0许可证。



