five

Salesforce/wikisql

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/Salesforce/wikisql
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced language: - en language_creators: - found - machine-generated license: - unknown multilinguality: - monolingual pretty_name: WikiSQL size_categories: - 10K<n<100K source_datasets: - original task_categories: - text2text-generation task_ids: [] paperswithcode_id: wikisql tags: - text-to-sql dataset_info: features: - name: phase dtype: int32 - name: question dtype: string - name: table struct: - name: header sequence: string - name: page_title dtype: string - name: page_id dtype: string - name: types sequence: string - name: id dtype: string - name: section_title dtype: string - name: caption dtype: string - name: rows sequence: sequence: string - name: name dtype: string - name: sql struct: - name: human_readable dtype: string - name: sel dtype: int32 - name: agg dtype: int32 - name: conds sequence: - name: column_index dtype: int32 - name: operator_index dtype: int32 - name: condition dtype: string splits: - name: test num_bytes: 32234761 num_examples: 15878 - name: validation num_bytes: 15159314 num_examples: 8421 - name: train num_bytes: 107345917 num_examples: 56355 download_size: 26164664 dataset_size: 154739992 --- # Dataset Card for "wikisql" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** https://github.com/salesforce/WikiSQL - **Paper:** [Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning](https://arxiv.org/abs/1709.00103) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 26.16 MB - **Size of the generated dataset:** 154.74 MB - **Total amount of disk used:** 180.90 MB ### Dataset Summary A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 26.16 MB - **Size of the generated dataset:** 154.74 MB - **Total amount of disk used:** 180.90 MB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "phase": 1, "question": "How would you answer a second test question?", "sql": { "agg": 0, "conds": { "column_index": [2], "condition": ["Some Entity"], "operator_index": [0] }, "human_readable": "SELECT Header1 FROM table WHERE Another Header = Some Entity", "sel": 0 }, "table": "{\"caption\": \"L\", \"header\": [\"Header1\", \"Header 2\", \"Another Header\"], \"id\": \"1-10015132-9\", \"name\": \"table_10015132_11\", \"page_i..." } ``` ### Data Fields The data fields are the same among all splits. #### default - `phase`: a `int32` feature. - `question`: a `string` feature. - `header`: a `list` of `string` features. - `page_title`: a `string` feature. - `page_id`: a `string` feature. - `types`: a `list` of `string` features. - `id`: a `string` feature. - `section_title`: a `string` feature. - `caption`: a `string` feature. - `rows`: a dictionary feature containing: - `feature`: a `string` feature. - `name`: a `string` feature. - `human_readable`: a `string` feature. - `sel`: a `int32` feature. - `agg`: a `int32` feature. - `conds`: a dictionary feature containing: - `column_index`: a `int32` feature. - `operator_index`: a `int32` feature. - `condition`: a `string` feature. ### Data Splits | name |train|validation|test | |-------|----:|---------:|----:| |default|56355| 8421|15878| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Citation Information ``` @article{zhongSeq2SQL2017, author = {Victor Zhong and Caiming Xiong and Richard Socher}, title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning}, journal = {CoRR}, volume = {abs/1709.00103}, year = {2017} } ``` ### Contributions Thanks to [@lewtun](https://github.com/lewtun), [@ghomasHudson](https://github.com/ghomasHudson), [@thomwolf](https://github.com/thomwolf) for adding this dataset.
提供机构:
Salesforce
原始信息汇总

数据集卡片概述

数据集描述

数据集摘要

WikiSQL 是一个包含 80654 个手工标注的示例数据集,用于开发自然语言接口以与关系数据库交互。该数据集包含来自 Wikipedia 的 24241 个表格,分布在各种问题和 SQL 查询中。

支持的任务和排行榜

更多信息需补充

语言

数据集主要包含英语内容。

数据集结构

数据实例

默认

  • 下载的数据集文件大小: 26.16 MB
  • 生成的数据集大小: 154.74 MB
  • 总磁盘使用量: 180.90 MB

一个 validation 数据实例的示例如下:

json { "phase": 1, "question": "How would you answer a second test question?", "sql": { "agg": 0, "conds": { "column_index": [2], "condition": ["Some Entity"], "operator_index": [0] }, "human_readable": "SELECT Header1 FROM table WHERE Another Header = Some Entity", "sel": 0 }, "table": "{"caption": "L", "header": ["Header1", "Header 2", "Another Header"], "id": "1-10015132-9", "name": "table_10015132_11", "page_i..." }

数据字段

所有分割的数据字段相同。

默认

  • phase: 一个 int32 特征。
  • question: 一个 string 特征。
  • header: 一个 liststring 特征。
  • page_title: 一个 string 特征。
  • page_id: 一个 string 特征。
  • types: 一个 liststring 特征。
  • id: 一个 string 特征。
  • section_title: 一个 string 特征。
  • caption: 一个 string 特征。
  • rows: 一个字典特征,包含:
    • feature: 一个 string 特征。
  • name: 一个 string 特征。
  • human_readable: 一个 string 特征。
  • sel: 一个 int32 特征。
  • agg: 一个 int32 特征。
  • conds: 一个字典特征,包含:
    • column_index: 一个 int32 特征。
    • operator_index: 一个 int32 特征。
    • condition: 一个 string 特征。

数据分割

名称 训练集 验证集 测试集
默认 56355 8421 15878

数据集创建

策划理由

更多信息需补充

源数据

初始数据收集和规范化

更多信息需补充

源语言生产者是谁?

更多信息需补充

标注

标注过程

更多信息需补充

标注者是谁?

更多信息需补充

个人和敏感信息

更多信息需补充

使用数据的注意事项

数据集的社会影响

更多信息需补充

偏见的讨论

更多信息需补充

其他已知限制

更多信息需补充

附加信息

数据集策展人

更多信息需补充

许可信息

更多信息需补充

引用信息

bibtex @article{zhongSeq2SQL2017, author = {Victor Zhong and Caiming Xiong and Richard Socher}, title = {Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning}, journal = {CoRR}, volume = {abs/1709.00103}, year = {2017} }

贡献

感谢 @lewtun, @ghomasHudson, @thomwolf 添加此数据集。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作