five

motherduckdb/duckdb-text2sql-25k

收藏
Hugging Face2024-04-07 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/motherduckdb/duckdb-text2sql-25k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 task_categories: - text2text-generation language: - en tags: - text-2-sql pretty_name: duckdb-text2sql-25k size_categories: - 10K<n<100K --- # Dataset Summary The duckdb-text2sql-25k dataset contains 25,000 DuckDB text-2-sql pairs covering diverse aspects of DuckDB's SQL syntax. We synthesized this dataset using [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), based on [DuckDB's v0.9.2 documentation](https://duckdb.org/docs/archive/0.9/) and [Spider](https://huggingface.co/datasets/spider) schemas that were translated to DuckDB syntax and enriched with nested type columns. Each training sample consists of a natural language prompt, a corresponding (optional) schema, and a resulting query. Each pair furthermore has a category property that indicates which part of the documentation was used generate the sample. We applied various techniques to validate the syntactical and semantic correctness of the synthesized statements. # How to use it ```python from datasets import load_dataset dataset = load_dataset("motherduckdb/duckdb-text2sql-25k") ``` We recommend using a prompt template similar to the one used for [DuckDB-NSQL-7B](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1#how-to-use) training. # Dataset Structure ## Data Fields - `prompt` (string): the instruction to generate SQL. - `query` (string): the SQL statement. - `schema` (string): the associated schema as CREATE TABLE statements. - `category` (string): the category of the query. # Languages The language of the data is primarily English. # Source Data and Licensing Information Schemas in this dataset are derived from [Spider](https://huggingface.co/datasets/spider), with CC-BY-SA-4.0 License. We publish our dataset under the same license.
提供机构:
motherduckdb
原始信息汇总

数据集概述

数据集名称

  • pretty_name: duckdb-text2sql-25k

数据集内容

  • 描述: 包含25,000个DuckDB文本到SQL的配对,覆盖DuckDB SQL语法的多个方面。
  • 生成方式: 使用Mixtral 8x7B模型合成,基于DuckDB v0.9.2文档和Spider数据集的架构,经过转换和增强处理。
  • 样本组成: 每个样本包括自然语言提示、相应的(可选)架构和生成的查询,以及指示生成样本所用文档部分的类别属性。

数据集结构

  • 数据字段:
    • prompt (字符串): 生成SQL的指令。
    • query (字符串): SQL语句。
    • schema (字符串): 相关的架构,以CREATE TABLE语句表示。
    • category (字符串): 查询的类别。

语言

  • 主要语言: 英语

许可信息

  • 许可证: CC-BY-SA-4.0
  • 源数据许可: 架构来源于Spider数据集,同样使用CC-BY-SA-4.0许可证。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作