Marchanjo/spider-FIT-en-extra-3enr-1enb

Name: Marchanjo/spider-FIT-en-extra-3enr-1enb
Creator: Marchanjo
Published: 2024-01-16 12:42:35
License: 暂无描述

Hugging Face2024-01-16 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Marchanjo/spider-FIT-en-extra-3enr-1enb

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 --- Distributed under the Creative Commons-by-sa-4.0 respecting the ShareAlike of the [Spider Dataset](https://yale-lily.github.io/spider). Code explanations and links for the model's checkpoints and datasets are on Github [mRAT-SQL](https://github.com/C4AI/gap-text2sql) Here is the [Hugging Face collection](https://huggingface.co/collections/Marchanjo/mrat-sql-65a671743bb0e70b416561f6), you can download the model's checkpoints and datasets, but to understand is better to go to Github [mRAT-SQL](https://github.com/C4AI/gap-text2sql). # mRAT-SQL-FIT ## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention Marcelo Archanjo Jose, Fabio Gagliardi Cozman Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql). [paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256). # mRAT-SQL+GAP ## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer Marcelo Archanjo José, Fabio Gagliardi Cozman The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql). BRACIS 2021: [paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546). Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)

提供机构：

Marchanjo

原始信息汇总