five

tjaffri/wikisql-generate

收藏
Hugging Face2023-06-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/tjaffri/wikisql-generate
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: bsd-3-clause dataset_info: features: - name: input dtype: string - name: table_info dtype: string - name: sql_cmd dtype: string splits: - name: test num_bytes: 9526974 num_examples: 15462 - name: validation num_bytes: 5034756 num_examples: 8243 - name: train num_bytes: 33996901 num_examples: 54963 download_size: 11329076 dataset_size: 48558631 --- # WikiSQL Dataset (Reformatted for Generative Models) This is the exact same dataset as WikiSQL: https://huggingface.co/datasets/wikisql, but with the data reformatted to allow direct use with text generation LLMs. The original license and credits for the original dataset remain in place. Specifically, the changes from standard WikiSQL are: 1. The table details in WikiSQL were included as dictionaries but tools like [LangChain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) and [LlamaIndex](https://medium.com/llamaindex-blog/combining-text-to-sql-with-semantic-search-for-retrieval-augmented-generation-c60af30ec3b) build their prompts using a SQL DESCRIBE of the tables, which is included in this dataset as the table_info. 1. In addition, some of the SQL commands in WikiSQL that were not syntactically valid (e.g. due to identifiers not quoted) were removed. Specifically, we created in-memory (SQLite) tables using the SQL DESCRIBE of the tables, then ran the WikiSQL human readable SQL query against these in-memory tables. Any SQL queries that threw exceptions for any reason were discarded, and the rest that ran without exceptions were included in this dataset as the sql_cmd. 1. The SQL queries under sql_cmd were also formatted to capitalize keywords and do other pretty printing of the SQL using [SQLParse](https://sqlparse.readthedocs.io/en/latest/) to make the SQL more standard and easier to learn for smaller models. # Suggested Uses This dataset may be used for the following purposes: 1. Combine SQL queries with text based retrieval, using techniques like the [LlamaIndex SQLAutoVectorQueryEngine](https://gpt-index.readthedocs.io/en/latest/examples/query_engine/SQLAutoVectorQueryEngine.html). 1. Fine tuning LLMs to generate SQL commands from natural language inputs, given SQL DESCRIBE of tables and various rows. This is exactly the use case for the [LangChain](https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html) SQLChain, so once fine tuned these LLMs may be used directly with these chains for theoretically better results (not tried at the time of writing) 1. Few shot prompt seeding of LLMs used to generate SQL commands from natural language inputs.
提供机构:
tjaffri
原始信息汇总

数据集概述

数据集名称

  • WikiSQL Dataset (Reformatted for Generative Models)

数据集特征

  • input: 数据类型为字符串
  • table_info: 数据类型为字符串
  • sql_cmd: 数据类型为字符串

数据集分割

  • test: 包含15462个样本,总大小为9526974字节
  • validation: 包含8243个样本,总大小为5034756字节
  • train: 包含54963个样本,总大小为33996901字节

数据集大小

  • 下载大小: 11329076字节
  • 数据集总大小: 48558631字节

数据集用途

  • 结合SQL查询与文本检索技术
  • 用于微调语言模型以生成SQL命令
  • 用于少量样本提示生成SQL命令
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作