five

detakarang/sql-create-context-id

收藏
Hugging Face2023-11-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/detakarang/sql-create-context-id
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - text-generation - question-answering - table-question-answering language: - id tags: - SQL - code - NLP - text-to-sql - context-sql - spider - wikisql - sqlglot pretty_name: sql-create-context-id size_categories: - 10K<n<100K --- #### Overview This dataset is a fork from [sql-create-context](https://huggingface.co/datasets/b-mc2/sql-create-context) <br> This dataset builds from [WikiSQL](https://huggingface.co/datasets/wikisql) and [Spider](https://huggingface.co/datasets/spider). There are 78,577 examples of natural language queries, SQL CREATE TABLE statements, and SQL Query answering the question using the CREATE statement as context. This dataset was built with text-to-sql LLMs in mind, intending to prevent hallucination of column and table names often seen when trained on text-to-sql datasets. The CREATE TABLE statement can often be copy and pasted from different DBMS and provides table names, column names and their data types. By providing just the CREATE TABLE statement as context, we can hopefully provide better grounding for models without having to provide actual rows of data, limiting token usage and exposure to private, sensitive, or proprietary data. #### Cleansing and Augmentation Cleansing and data augmentation has been done on the combined WikiSQL and Spider data. I used [SQLGlot](https://github.com/tobymao/sqlglot) on queries from Spider and WikiSQL and parsed them into different tables and columns, I then inferred column data types based on usage of `>` `<` operators as well as the use of `MIN()` `MAX()` `AVG()` `SUM()` on columns. While this isn't perfect, it increases the likelihood of inferring the correct datatype for a column, the columns otherwise default to VARCHAR type. These tables and columns are then used to generate CREATE TABLE statements using the inferred types. SQLGlot is used again to ensure both the SQL queries and CREATE TABLE statements parse without errors. Some queries that do not have column names, e.g. SELECT * FROM table, have a default Id column added to the CREATE TABLE statement. Some other queries which use the generic `table` as the FROM table have instead been changed to a variation of `table_name_1` or some other number which is also reflected in the CREATE TABLE statement. #### TODO - Further augment the data by converting queries and CREATE TABLE statements into different SQL dialects, this can be done with SQLGlot. Reference to the dialect might also be added to the question. - Support other informative contexts beyond CREATE TABLE - Better parse datatypes to clean up things like numbers for column names and other numbers as strings If you have any edits you'd like to see in a version 2 of this dataset, let me know. Random sample: ```json { "question": "Berapa banyak kepala departemen yang lebih tua dari 56 tahun?", "context": "CREATE TABLE head (age INTEGER)", "answer": "SELECT COUNT(*) FROM head WHERE age > 56" }, { "question": "Sebutkan nama, lahir negara bagian mana dan usia kepala departemen yang dipesan berdasarkan usia.", "context": "CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)", "answer": "SELECT name, born_state, age FROM head ORDER BY age" }, ```
提供机构:
detakarang
原始信息汇总

数据集概述

该数据集是从sql-create-context分叉而来,基于WikiSQLSpider构建。包含78,577个自然语言查询、SQL CREATE TABLE语句及使用CREATE语句作为上下文的SQL查询示例。旨在为文本到SQL的LLMs提供数据,以减少在文本到SQL数据集训练时常出现的列和表名幻觉问题。通过仅提供CREATE TABLE语句作为上下文,可以在不提供实际数据行的情况下,为模型提供更好的基础,从而限制令牌使用和避免暴露敏感或专有数据。

数据清洗与增强

对合并的WikiSQL和Spider数据进行了清洗和数据增强。使用SQLGlot解析来自Spider和WikiSQL的查询,并将其转换为不同的表和列,然后根据> <操作符以及MIN() MAX() AVG() SUM()的使用推断列数据类型。这些表和列用于生成使用推断类型的CREATE TABLE语句,并再次使用SQLGlot确保SQL查询和CREATE TABLE语句无错误解析。对于没有列名的查询(如SELECT * FROM table),默认添加了Id列到CREATE TABLE语句中。对于使用通用table作为FROM表的查询,已更改为table_name_1或其他数字,并在CREATE TABLE语句中反映。

未来计划

  • 通过将查询和CREATE TABLE语句转换为不同的SQL方言来进一步增强数据,可以使用SQLGlot完成。
  • 支持除CREATE TABLE之外的其他信息性上下文。
  • 更好地解析数据类型,以清理列名和其他数字作为字符串的问题。

示例

json { "question": "Berapa banyak kepala departemen yang lebih tua dari 56 tahun?", "context": "CREATE TABLE head (age INTEGER)", "answer": "SELECT COUNT(*) FROM head WHERE age > 56" }, { "question": "Sebutkan nama, lahir negara bagian mana dan usia kepala departemen yang dipesan berdasarkan usia.", "context": "CREATE TABLE head (name VARCHAR, born_state VARCHAR, age VARCHAR)", "answer": "SELECT name, born_state, age FROM head ORDER BY age" }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作