five

xlangai/spider

收藏
Hugging Face2024-03-27 更新2024-04-19 收录
下载链接:
https://hf-mirror.com/datasets/xlangai/spider
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - expert-generated language_creators: - expert-generated - machine-generated language: - en license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 1K<n<10K source_datasets: - original task_categories: - text2text-generation task_ids: [] paperswithcode_id: spider-1 pretty_name: Spider tags: - text-to-sql dataset_info: config_name: spider features: - name: db_id dtype: string - name: query dtype: string - name: question dtype: string - name: query_toks sequence: string - name: query_toks_no_value sequence: string - name: question_toks sequence: string splits: - name: train num_bytes: 4743786 num_examples: 7000 - name: validation num_bytes: 682090 num_examples: 1034 download_size: 957246 dataset_size: 5425876 configs: - config_name: spider data_files: - split: train path: spider/train-* - split: validation path: spider/validation-* default: true --- # Dataset Card for Spider ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://yale-lily.github.io/spider - **Repository:** https://github.com/taoyds/spider - **Paper:** https://www.aclweb.org/anthology/D18-1425/ - **Paper:** https://arxiv.org/abs/1809.08887 - **Point of Contact:** [Yale LILY](https://yale-lily.github.io/) ### Dataset Summary Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. ### Supported Tasks and Leaderboards The leaderboard can be seen at https://yale-lily.github.io/spider ### Languages The text in the dataset is in English. ## Dataset Structure ### Data Instances **What do the instances that comprise the dataset represent?** Each instance is natural language question and the equivalent SQL query **How many instances are there in total?** **What data does each instance consist of?** [More Information Needed] ### Data Fields * **db_id**: Database name * **question**: Natural language to interpret into SQL * **query**: Target SQL query * **query_toks**: List of tokens for the query * **query_toks_no_value**: List of tokens for the query * **question_toks**: List of tokens for the question ### Data Splits **train**: 7000 questions and SQL query pairs **dev**: 1034 question and SQL query pairs [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization #### Who are the source language producers? [More Information Needed] ### Annotations The dataset was annotated by 11 college students at Yale University #### Annotation process #### Who are the annotators? ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset ### Discussion of Biases [More Information Needed] ### Other Known Limitations ## Additional Information The listed authors in the homepage are maintaining/supporting the dataset. ### Dataset Curators [More Information Needed] ### Licensing Information The spider dataset is licensed under the [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) [More Information Needed] ### Citation Information ``` @inproceedings{yu-etal-2018-spider, title = "{S}pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-{SQL} Task", author = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir", editor = "Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun{'}ichi", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct # "-" # nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1425", doi = "10.18653/v1/D18-1425", pages = "3911--3921", archivePrefix={arXiv}, eprint={1809.08887}, primaryClass={cs.CL}, } ``` ### Contributions Thanks to [@olinguyen](https://github.com/olinguyen) for adding this dataset.

# 数据集元数据 标注生成者: - 专家生成 语言生成者: - 专家生成 - 机器生成 语言: - 英语(en) 许可协议: - CC BY-SA 4.0 多语言属性: - 单语 样本规模: - 1000 < 样本数 < 10000 源数据集: - 原创数据集 任务类别: - 文本到文本生成 任务子类别: - 无 PapersWithCode编号: - spider-1 正式名称: - Spider 标签: - 文本到SQL(text-to-sql) 数据集信息: 配置名称:spider 数据字段: - 名称:db_id,数据类型:字符串 - 名称:query,数据类型:字符串 - 名称:question,数据类型:字符串 - 名称:query_toks,序列类型,元素为字符串 - 名称:query_toks_no_value,序列类型,元素为字符串 - 名称:question_toks,序列类型,元素为字符串 数据划分: - 训练集:字节数4743786,样本数7000 - 验证集:字节数682090,样本数1034 下载大小:957246字节 数据集总大小:5425876字节 配置项: - 配置名称:spider 数据文件: - 训练集:路径spider/train-* - 验证集:路径spider/validation-* 默认启用:是 # Spider数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [致谢](#contributions) ## 数据集描述 - **主页:** https://yale-lily.github.io/spider - **代码仓库:** https://github.com/taoyds/spider - **相关论文:** https://www.aclweb.org/anthology/D18-1425/、https://arxiv.org/abs/1809.08887 - **联系方式:** [耶鲁LILY实验室](https://yale-lily.github.io/) ### 数据集概述 Spider是一个大规模跨领域复杂语义解析与文本到SQL数据集,由11名耶鲁大学学生标注完成。Spider挑战赛的核心目标是开发面向跨领域数据库的自然语言交互接口。 ### 支持任务与排行榜 该数据集对应的排行榜可访问:https://yale-lily.github.io/spider ### 语言 数据集内文本均为英语。 ## 数据集结构 ### 数据实例 **数据实例代表什么?** 每个实例由一条自然语言问题及其对应的等价SQL查询组成。 **总样本量为多少?** [需补充更多信息] **每个实例包含哪些数据内容?** [需补充更多信息] ### 数据字段 * **db_id**:数据库ID * **question**:需转换为SQL查询的自然语言问题 * **query**:目标SQL查询语句 * **query_toks**:SQL查询语句的分词列表 * **query_toks_no_value**:不含常量值的SQL查询语句分词列表 * **question_toks**:自然语言问题的分词列表 ### 数据划分 **训练集(train)**:7000条自然语言问题与SQL查询配对样本 **开发集(dev)**:1034条自然语言问题与SQL查询配对样本 [需补充更多信息] ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本创作者是谁? [需补充更多信息] ### 标注信息 本数据集由11名耶鲁大学在校学生完成标注。 #### 标注流程 [需补充更多信息] #### 标注者是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 主页所列作者团队负责该数据集的维护与支持。 ### 数据集维护者 [需补充更多信息] ### 许可信息 Spider数据集采用CC BY-SA 4.0(知识共享署名-相同方式共享4.0)协议许可,协议详情可访问:https://creativecommons.org/licenses/by-sa/4.0/legalcode [需补充更多信息] ### 引用信息 @inproceedings{yu-etal-2018-spider, title = "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task", author = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir", editor = "Riloff, Ellen and Chiang, David and Hockenmaier, Julia and Tsujii, Jun'ichi", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", month = oct - nov, year = "2018", address = "Brussels, Belgium", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1425", doi = "10.18653/v1/D18-1425", pages = "3911--3921", archivePrefix={arXiv}, eprint={1809.08887}, primaryClass={cs.CL}, } ### 致谢 感谢[@olinguyen](https://github.com/olinguyen) 为本数据集添加了官方收录支持。
提供机构:
xlangai
原始信息汇总

数据集概述

数据集基本信息

  • 名称: Spider
  • 语言: 英语
  • 许可证: CC BY-SA 4.0
  • 多语言性: 单语种
  • 大小: 1K<n<10K
  • 源数据集: 原始数据
  • 任务类别: 文本到文本生成
  • 标签: 文本到SQL

数据集结构

  • 特征:
    • db_id: 字符串
    • query: 字符串
    • question: 字符串
    • query_toks: 字符串序列
    • query_toks_no_value: 字符串序列
    • question_toks: 字符串序列
  • 数据分割:
    • 训练集: 7000个实例,总字节数4743786
    • 验证集: 1034个实例,总字节数682090
  • 下载大小: 957246字节
  • 数据集大小: 5425876字节

数据集创建

  • 注释创建者: 专家生成
  • 语言创建者: 专家生成和机器生成
  • 注释: 由11名耶鲁大学学生标注

许可证信息

  • 许可证: CC BY-SA 4.0

引用信息

@inproceedings{yu-etal-2018-spider, title = "{S}pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-{SQL} Task", author = "Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/D18-1425", doi = "10.18653/v1/D18-1425", pages = "3911--3921", archivePrefix={arXiv}, eprint={1809.08887}, primaryClass={cs.CL}, }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Spider是一个用于文本到SQL任务的大规模跨领域数据集,包含自然语言问题和对应的SQL查询,旨在促进自然语言接口的开发。数据集由11名耶鲁学生标注,分为7000个训练实例和1034个开发实例,采用CC BY-SA 4.0许可。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作