five

composite/pauq

收藏
Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/composite/pauq
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: ru_os features: - name: id dtype: string - name: db_id dtype: string - name: source dtype: string - name: type dtype: string - name: question dtype: string - name: query dtype: string - name: sql sequence: string - name: question_toks sequence: string - name: query_toks sequence: string - name: query_toks_no_values sequence: string - name: template dtype: string splits: - name: train num_examples: 8800 - name: test num_examples: 1074 - config_name: en_os features: - name: id dtype: string - name: db_id dtype: string - name: source dtype: string - name: type dtype: string - name: question dtype: string - name: query dtype: string - name: sql sequence: string - name: question_toks sequence: string - name: query_toks sequence: string - name: query_toks_no_values sequence: string - name: template dtype: string splits: - name: train num_examples: 8800 - name: test num_examples: 1076 - config_name: ru_trl features: - name: id dtype: string - name: db_id dtype: string - name: source dtype: string - name: type dtype: string - name: question dtype: string - name: query dtype: string - name: sql sequence: string - name: question_toks sequence: string - name: query_toks sequence: string - name: query_toks_no_values sequence: string - name: template dtype: string splits: - name: train num_examples: 7890 - name: test num_examples: 1971 - config_name: en_trl features: - name: id dtype: string - name: db_id dtype: string - name: source dtype: string - name: type dtype: string - name: question dtype: string - name: query dtype: string - name: sql sequence: string - name: question_toks sequence: string - name: query_toks sequence: string - name: query_toks_no_values sequence: string - name: template dtype: string splits: - name: train num_examples: 7890 - name: test num_examples: 1974 - config_name: ru_tsl features: - name: id dtype: string - name: db_id dtype: string - name: source dtype: string - name: type dtype: string - name: question dtype: string - name: query dtype: string - name: sql sequence: string - name: question_toks sequence: string - name: query_toks sequence: string - name: query_toks_no_values sequence: string - name: template dtype: string splits: - name: train num_examples: 7900 - name: test num_examples: 1969 - config_name: en_tsl features: - name: id dtype: string - name: db_id dtype: string - name: source dtype: string - name: type dtype: string - name: question dtype: string - name: query dtype: string - name: sql sequence: string - name: question_toks sequence: string - name: query_toks sequence: string - name: query_toks_no_values sequence: string - name: template dtype: string splits: - name: train num_examples: 7900 - name: test num_examples: 1974 --- # Dataset Card for [Dataset Name] ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Languages](#languages) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Additional Information](#additional-information) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** Link to databases: https://drive.google.com/file/d/1Xjbp207zfCaBxhPgt-STB_RxwNo2TIW2/view ### Dataset Summary The Russian version of the [Spider](https://yale-lily.github.io/spider) - Yale Semantic Parsing and Text-to-SQL Dataset. Major changings: - Adding (not replacing) new Russian language values in DB tables. Table and DB names remain the original. - Localization of natural language questions into Russian. All DB values replaced by new. - Changing in SQL-queries filters. - Filling empty table with values. - Complementing the dataset with the new samples of underrepresented types. ### Languages Russian ## Dataset Creation ### Curation Rationale The translation from English to Russian is undertaken by a professional human translator with SQL-competence. A verification of the translated questions and their conformity with the queries, and an updating of the databases are undertaken by 4 computer science students. Details are in the [section 3](https://aclanthology.org/2022.findings-emnlp.175.pdf). ## Additional Information ### Licensing Information The presented dataset have been collected in a manner which is consistent with the terms of use of the original Spider, which is distributed under the CC BY-SA 4.0 license. ### Citation Information [Paper link](https://aclanthology.org/2022.findings-emnlp.175.pdf) ``` @inproceedings{bakshandaeva-etal-2022-pauq, title = "{PAUQ}: Text-to-{SQL} in {R}ussian", author = "Bakshandaeva, Daria and Somov, Oleg and Dmitrieva, Ekaterina and Davydova, Vera and Tutubalina, Elena", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-emnlp.175", pages = "2355--2376", abstract = "Semantic parsing is an important task that allows to democratize human-computer interaction. One of the most popular text-to-SQL datasets with complex and diverse natural language (NL) questions and SQL queries is Spider. We construct and complement a Spider dataset for Russian, thus creating the first publicly available text-to-SQL dataset for this language. While examining its components - NL questions, SQL queries and databases content - we identify limitations of the existing database structure, fill out missing values for tables and add new requests for underrepresented categories. We select thirty functional test sets with different features that can be used for the evaluation of neural models{'} abilities. To conduct the experiments, we adapt baseline architectures RAT-SQL and BRIDGE and provide in-depth query component analysis. On the target language, both models demonstrate strong results with monolingual training and improved accuracy in multilingual scenario. In this paper, we also study trade-offs between machine-translated and manually-created NL queries. At present, Russian text-to-SQL is lacking in datasets as well as trained models, and we view this work as an important step towards filling this gap.", } ``` ### Contributions Thanks to [@gugutse](https://github.com/Gugutse), [@runnerup96](https://github.com/runnerup96), [@dmi3eva](https://github.com/dmi3eva), [@veradavydova](https://github.com/VeraDavydova), [@tutubalinaev](https://github.com/tutubalinaev) for adding this dataset.
提供机构:
composite
原始信息汇总

数据集概述

数据集配置

  • config_name: ru_os

    • 特征:
      • id: 字符串
      • db_id: 字符串
      • source: 字符串
      • type: 字符串
      • question: 字符串
      • query: 字符串
      • sql: 字符串序列
      • question_toks: 字符串序列
      • query_toks: 字符串序列
      • query_toks_no_values: 字符串序列
      • template: 字符串
    • 分割:
      • train: 8800个样本
      • test: 1074个样本
  • config_name: en_os

    • 特征:
      • id: 字符串
      • db_id: 字符串
      • source: 字符串
      • type: 字符串
      • question: 字符串
      • query: 字符串
      • sql: 字符串序列
      • question_toks: 字符串序列
      • query_toks: 字符串序列
      • query_toks_no_values: 字符串序列
      • template: 字符串
    • 分割:
      • train: 8800个样本
      • test: 1076个样本
  • config_name: ru_trl

    • 特征:
      • id: 字符串
      • db_id: 字符串
      • source: 字符串
      • type: 字符串
      • question: 字符串
      • query: 字符串
      • sql: 字符串序列
      • question_toks: 字符串序列
      • query_toks: 字符串序列
      • query_toks_no_values: 字符串序列
      • template: 字符串
    • 分割:
      • train: 7890个样本
      • test: 1971个样本
  • config_name: en_trl

    • 特征:
      • id: 字符串
      • db_id: 字符串
      • source: 字符串
      • type: 字符串
      • question: 字符串
      • query: 字符串
      • sql: 字符串序列
      • question_toks: 字符串序列
      • query_toks: 字符串序列
      • query_toks_no_values: 字符串序列
      • template: 字符串
    • 分割:
      • train: 7890个样本
      • test: 1974个样本
  • config_name: ru_tsl

    • 特征:
      • id: 字符串
      • db_id: 字符串
      • source: 字符串
      • type: 字符串
      • question: 字符串
      • query: 字符串
      • sql: 字符串序列
      • question_toks: 字符串序列
      • query_toks: 字符串序列
      • query_toks_no_values: 字符串序列
      • template: 字符串
    • 分割:
      • train: 7900个样本
      • test: 1969个样本
  • config_name: en_tsl

    • 特征:
      • id: 字符串
      • db_id: 字符串
      • source: 字符串
      • type: 字符串
      • question: 字符串
      • query: 字符串
      • sql: 字符串序列
      • question_toks: 字符串序列
      • query_toks: 字符串序列
      • query_toks_no_values: 字符串序列
      • template: 字符串
    • 分割:
      • train: 7900个样本
      • test: 1974个样本

数据集语言

  • 俄语

数据集创建

  • 翻译与验证:
    • 由具有SQL能力的专业人类翻译者进行从英语到俄语的翻译。
    • 由4名计算机科学学生验证翻译问题与查询的一致性,并更新数据库。

许可证信息

  • 遵循原始Spider数据集的CC BY-SA 4.0许可证。

引用信息

@inproceedings{bakshandaeva-etal-2022-pauq, title = "{PAUQ}: Text-to-{SQL} in {R}ussian", author = "Bakshandaeva, Daria and Somov, Oleg and Dmitrieva, Ekaterina and Davydova, Vera and Tutubalina, Elena", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-emnlp.175", pages = "2355--2376", abstract = "Semantic parsing is an important task that allows to democratize human-computer interaction. One of the most popular text-to-SQL datasets with complex and diverse natural language (NL) questions and SQL queries is Spider. We construct and complement a Spider dataset for Russian, thus creating the first publicly available text-to-SQL dataset for this language. While examining its components - NL questions, SQL queries and databases content - we identify limitations of the existing database structure, fill out missing values for tables and add new requests for underrepresented categories. We select thirty functional test sets with different features that can be used for the evaluation of neural models{} abilities. To conduct the experiments, we adapt baseline architectures RAT-SQL and BRIDGE and provide in-depth query component analysis. On the target language, both models demonstrate strong results with monolingual training and improved accuracy in multilingual scenario. In this paper, we also study trade-offs between machine-translated and manually-created NL queries. At present, Russian text-to-SQL is lacking in datasets as well as trained models, and we view this work as an important step towards filling this gap.", }

贡献者

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作