xlangai/spider
收藏Hugging Face2024-03-27 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/xlangai/spider
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
- machine-generated
language:
- en
license:
- cc-by-sa-4.0
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- original
task_categories:
- text2text-generation
task_ids: []
paperswithcode_id: spider-1
pretty_name: Spider
tags:
- text-to-sql
dataset_info:
config_name: spider
features:
- name: db_id
dtype: string
- name: query
dtype: string
- name: question
dtype: string
- name: query_toks
sequence: string
- name: query_toks_no_value
sequence: string
- name: question_toks
sequence: string
splits:
- name: train
num_bytes: 4743786
num_examples: 7000
- name: validation
num_bytes: 682090
num_examples: 1034
download_size: 957246
dataset_size: 5425876
configs:
- config_name: spider
data_files:
- split: train
path: spider/train-*
- split: validation
path: spider/validation-*
default: true
---
# Dataset Card for Spider
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://yale-lily.github.io/spider
- **Repository:** https://github.com/taoyds/spider
- **Paper:** https://www.aclweb.org/anthology/D18-1425/
- **Paper:** https://arxiv.org/abs/1809.08887
- **Point of Contact:** [Yale LILY](https://yale-lily.github.io/)
### Dataset Summary
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students.
The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.
### Supported Tasks and Leaderboards
The leaderboard can be seen at https://yale-lily.github.io/spider
### Languages
The text in the dataset is in English.
## Dataset Structure
### Data Instances
**What do the instances that comprise the dataset represent?**
Each instance is natural language question and the equivalent SQL query
**How many instances are there in total?**
**What data does each instance consist of?**
[More Information Needed]
### Data Fields
* **db_id**: Database name
* **question**: Natural language to interpret into SQL
* **query**: Target SQL query
* **query_toks**: List of tokens for the query
* **query_toks_no_value**: List of tokens for the query
* **question_toks**: List of tokens for the question
### Data Splits
**train**: 7000 questions and SQL query pairs
**dev**: 1034 question and SQL query pairs
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
#### Who are the source language producers?
[More Information Needed]
### Annotations
The dataset was annotated by 11 college students at Yale University
#### Annotation process
#### Who are the annotators?
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
## Additional Information
The listed authors in the homepage are maintaining/supporting the dataset.
### Dataset Curators
[More Information Needed]
### Licensing Information
The spider dataset is licensed under
the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode)
[More Information Needed]
### Citation Information
```
@inproceedings{yu-etal-2018-spider,
title = "{S}pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-{SQL} Task",
author = "Yu, Tao and
Zhang, Rui and
Yang, Kai and
Yasunaga, Michihiro and
Wang, Dongxu and
Li, Zifan and
Ma, James and
Li, Irene and
Yao, Qingning and
Roman, Shanelle and
Zhang, Zilin and
Radev, Dragomir",
editor = "Riloff, Ellen and
Chiang, David and
Hockenmaier, Julia and
Tsujii, Jun{'}ichi",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
month = oct # "-" # nov,
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D18-1425",
doi = "10.18653/v1/D18-1425",
pages = "3911--3921",
archivePrefix={arXiv},
eprint={1809.08887},
primaryClass={cs.CL},
}
```
### Contributions
Thanks to [@olinguyen](https://github.com/olinguyen) for adding this dataset.
# 数据集元数据
标注生成者:
- 专家生成
语言生成者:
- 专家生成
- 机器生成
语言:
- 英语(en)
许可协议:
- CC BY-SA 4.0
多语言属性:
- 单语
样本规模:
- 1000 < 样本数 < 10000
源数据集:
- 原创数据集
任务类别:
- 文本到文本生成
任务子类别:
- 无
PapersWithCode编号:
- spider-1
正式名称:
- Spider
标签:
- 文本到SQL(text-to-sql)
数据集信息:
配置名称:spider
数据字段:
- 名称:db_id,数据类型:字符串
- 名称:query,数据类型:字符串
- 名称:question,数据类型:字符串
- 名称:query_toks,序列类型,元素为字符串
- 名称:query_toks_no_value,序列类型,元素为字符串
- 名称:question_toks,序列类型,元素为字符串
数据划分:
- 训练集:字节数4743786,样本数7000
- 验证集:字节数682090,样本数1034
下载大小:957246字节
数据集总大小:5425876字节
配置项:
- 配置名称:spider
数据文件:
- 训练集:路径spider/train-*
- 验证集:路径spider/validation-*
默认启用:是
# Spider数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [致谢](#contributions)
## 数据集描述
- **主页:** https://yale-lily.github.io/spider
- **代码仓库:** https://github.com/taoyds/spider
- **相关论文:** https://www.aclweb.org/anthology/D18-1425/、https://arxiv.org/abs/1809.08887
- **联系方式:** [耶鲁LILY实验室](https://yale-lily.github.io/)
### 数据集概述
Spider是一个大规模跨领域复杂语义解析与文本到SQL数据集,由11名耶鲁大学学生标注完成。Spider挑战赛的核心目标是开发面向跨领域数据库的自然语言交互接口。
### 支持任务与排行榜
该数据集对应的排行榜可访问:https://yale-lily.github.io/spider
### 语言
数据集内文本均为英语。
## 数据集结构
### 数据实例
**数据实例代表什么?**
每个实例由一条自然语言问题及其对应的等价SQL查询组成。
**总样本量为多少?**
[需补充更多信息]
**每个实例包含哪些数据内容?**
[需补充更多信息]
### 数据字段
* **db_id**:数据库ID
* **question**:需转换为SQL查询的自然语言问题
* **query**:目标SQL查询语句
* **query_toks**:SQL查询语句的分词列表
* **query_toks_no_value**:不含常量值的SQL查询语句分词列表
* **question_toks**:自然语言问题的分词列表
### 数据划分
**训练集(train)**:7000条自然语言问题与SQL查询配对样本
**开发集(dev)**:1034条自然语言问题与SQL查询配对样本
[需补充更多信息]
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源文本创作者是谁?
[需补充更多信息]
### 标注信息
本数据集由11名耶鲁大学在校学生完成标注。
#### 标注流程
[需补充更多信息]
#### 标注者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
主页所列作者团队负责该数据集的维护与支持。
### 数据集维护者
[需补充更多信息]
### 许可信息
Spider数据集采用CC BY-SA 4.0(知识共享署名-相同方式共享4.0)协议许可,协议详情可访问:https://creativecommons.org/licenses/by-sa/4.0/legalcode
[需补充更多信息]
### 引用信息
@inproceedings{yu-etal-2018-spider,
title = "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task",
author = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir",
editor = "Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun'ichi",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
month = oct - nov,
year = "2018",
address = "Brussels, Belgium",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D18-1425",
doi = "10.18653/v1/D18-1425",
pages = "3911--3921",
archivePrefix={arXiv},
eprint={1809.08887},
primaryClass={cs.CL},
}
### 致谢
感谢[@olinguyen](https://github.com/olinguyen) 为本数据集添加了官方收录支持。
提供机构:
xlangai
原始信息汇总
数据集概述
数据集基本信息
- 名称: Spider
- 语言: 英语
- 许可证: CC BY-SA 4.0
- 多语言性: 单语种
- 大小: 1K<n<10K
- 源数据集: 原始数据
- 任务类别: 文本到文本生成
- 标签: 文本到SQL
数据集结构
- 特征:
- db_id: 字符串
- query: 字符串
- question: 字符串
- query_toks: 字符串序列
- query_toks_no_value: 字符串序列
- question_toks: 字符串序列
- 数据分割:
- 训练集: 7000个实例,总字节数4743786
- 验证集: 1034个实例,总字节数682090
- 下载大小: 957246字节
- 数据集大小: 5425876字节
数据集创建
- 注释创建者: 专家生成
- 语言创建者: 专家生成和机器生成
- 注释: 由11名耶鲁大学学生标注
许可证信息
- 许可证: CC BY-SA 4.0
引用信息
@inproceedings{yu-etal-2018-spider, title = "{S}pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-{SQL} Task", author = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1425", doi = "10.18653/v1/D18-1425", pages = "3911--3921", archivePrefix={arXiv}, eprint={1809.08887}, primaryClass={cs.CL}, }
搜集汇总
数据集介绍

背景与挑战
背景概述
Spider是一个用于文本到SQL任务的大规模跨领域数据集,包含自然语言问题和对应的SQL查询,旨在促进自然语言接口的开发。数据集由11名耶鲁学生标注,分为7000个训练实例和1034个开发实例,采用CC BY-SA 4.0许可。
以上内容由遇见数据集搜集并总结生成



