tjaffri/wikisql-generate
收藏Hugging Face2023-06-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tjaffri/wikisql-generate
下载链接
链接失效反馈官方服务:
资源简介:
---
license: bsd-3-clause
dataset_info:
features:
- name: input
dtype: string
- name: table_info
dtype: string
- name: sql_cmd
dtype: string
splits:
- name: test
num_bytes: 9526974
num_examples: 15462
- name: validation
num_bytes: 5034756
num_examples: 8243
- name: train
num_bytes: 33996901
num_examples: 54963
download_size: 11329076
dataset_size: 48558631
---
# WikiSQL Dataset (Reformatted for Generative Models)
This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place.
Specifically, the changes from standard WikiSQL are:
1. The table details in WikiSQL were included as dictionaries but tools like [LangChain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) and [LlamaIndex](https://medium.com/llamaindex-blog/combining-text-to-sql-with-semantic-search-for-retrieval-augmented-generation-c60af30ec3b) build their prompts using a SQL DESCRIBE of the tables, which is included in this dataset as the table_info.
1. In addition, some of the SQL commands in WikiSQL that were not syntactically valid (e.g. due to identifiers not quoted) were removed. Specifically, we created in-memory (SQLite) tables using the SQL DESCRIBE of the tables, then ran the WikiSQL human readable SQL query against these in-memory tables. Any SQL queries that threw exceptions for any reason were discarded, and the rest that ran without exceptions were included in this dataset as the sql_cmd.
1. The SQL queries under sql_cmd were also formatted to capitalize keywords and do other pretty printing of the SQL using [SQLParse](https://sqlparse.readthedocs.io/en/latest/) to make the SQL more standard and easier to learn for smaller models.
# Suggested Uses
This dataset may be used for the following purposes:
1. Combine SQL queries with text based retrieval, using techniques like the [LlamaIndex SQLAutoVectorQueryEngine](https://gpt-index.readthedocs.io/en/latest/examples/query_engine/SQLAutoVectorQueryEngine.html).
1. Fine tuning LLMs to generate SQL commands from natural language inputs, given SQL DESCRIBE of tables and various rows. This is exactly the use case for the [LangChain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) SQLChain, so once fine tuned these LLMs may be used directly with these chains for theoretically better results (not tried at the time of writing)
1. Few shot prompt seeding of LLMs used to generate SQL commands from natural language inputs.
提供机构:
tjaffri
原始信息汇总
数据集概述
数据集名称
- WikiSQL Dataset (Reformatted for Generative Models)
数据集特征
- input: 数据类型为字符串
- table_info: 数据类型为字符串
- sql_cmd: 数据类型为字符串
数据集分割
- test: 包含15462个样本,总大小为9526974字节
- validation: 包含8243个样本,总大小为5034756字节
- train: 包含54963个样本,总大小为33996901字节
数据集大小
- 下载大小: 11329076字节
- 数据集总大小: 48558631字节
数据集用途
- 结合SQL查询与文本检索技术
- 用于微调语言模型以生成SQL命令
- 用于少量样本提示生成SQL命令



