five

Marchanjo/spider-FIT-en-extra-3enr-1enb

收藏
Hugging Face2024-01-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Marchanjo/spider-FIT-en-extra-3enr-1enb
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 --- Distributed under the Creative Commons-by-sa-4.0 respecting the ShareAlike of the [Spider Dataset](https://yale-lily.github.io/spider). Code explanations and links for the model's checkpoints and datasets are on Github [mRAT-SQL](https://github.com/C4AI/gap-text2sql) Here is the [Hugging Face collection](https://huggingface.co/collections/Marchanjo/mrat-sql-65a671743bb0e70b416561f6), you can download the model's checkpoints and datasets, but to understand is better to go to Github [mRAT-SQL](https://github.com/C4AI/gap-text2sql). # mRAT-SQL-FIT ## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention Marcelo Archanjo Jose, Fabio Gagliardi Cozman Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql). [paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256). # mRAT-SQL+GAP ## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer Marcelo Archanjo José, Fabio Gagliardi Cozman The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql). BRACIS 2021: [paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546). Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)
提供机构:
Marchanjo
原始信息汇总

mRAT-SQL-FIT

A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention

Marcelo Archanjo Jose, Fabio Gagliardi Cozman

  • 挑战:长文本序列在transformer模型中具有挑战性,由于自注意力机制导致内存需求呈二次方增长。
  • 解决方案:提出了一种训练过程,通过数据库模式剪枝(移除对查询无用的表和列名)来处理长文本序列,最多支持512个输入令牌。
  • 模型:使用mT5-large模型,通过数据增强的Spider数据集进行微调,支持四种语言:英语、葡萄牙语、西班牙语和法语。
  • 效果:在验证数据集(Dev)上,精确匹配准确率从0.718提高到0.736。
  • 资源:源代码、评估和检查点可在mRAT-SQL获取。

mRAT-SQL+GAP

mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer

Marcelo Archanjo José, Fabio Gagliardi Cozman

  • 研究背景:自然语言问题到SQL查询的翻译受到越来越多的关注,尤其是与transformer和类似语言模型相关。
  • 研究目标:探讨葡萄牙语输入问题到SQL查询的翻译。
  • 方法:基于多语言BART模型修改了RAT-SQL+GAP系统,并生成了Spider数据集的翻译版本。
  • 实验结果:使用双倍大小的训练数据集(英语和葡萄牙语)进行微调的多语言BART模型,在葡萄牙语测试数据集上达到了83%的基准性能。
  • 资源:多语言版本的RAT-SQL+GAP及其数据可在mRAT-SQL获取。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作