Marchanjo/spider-pt
收藏Hugging Face2024-01-16 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/Marchanjo/spider-pt
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
---
Distributed under the Creative Commons-by-sa-4.0 respecting the ShareAlike of the [Spider Dataset](https://yale-lily.github.io/spider).
Code explanations and links for the model's checkpoints and datasets are on Github [mRAT-SQL](https://github.com/C4AI/gap-text2sql)
Here is the [Hugging Face collection](https://huggingface.co/collections/Marchanjo/mrat-sql-65a671743bb0e70b416561f6), you can download the model's checkpoints and datasets, but to understand is better to go to Github [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
# mRAT-SQL-FIT
## A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
Marcelo Archanjo Jose, Fabio Gagliardi Cozman
Long sequences of text are challenging in the context of transformers, due to quadratic memory increase in the self-attention mechanism. As this issue directly affects the translation from natural language to SQL queries (as techniques usually take as input a concatenated text with the question and the database schema), we present techniques that allow long text sequences to be handled by transformers with up to 512 input tokens. We propose a training process with database schema pruning (removal of tables and columns names that are useless for the query of interest). In addition, we used a multilingual approach with the mT5-large model fine-tuned with a data-augmented Spider dataset in four languages simultaneously: English, Portuguese, Spanish, and French. Our proposed technique used the Spider dataset and increased the exact set match accuracy results from 0.718 to 0.736 in a validation dataset (Dev). Source code, evaluations, and checkpoints are available at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
[paper published in Springer-Nature - International Journal of Information Technology](https://doi.org/10.1007/s41870-023-01342-3), [here the SharedIt link](https://rdcu.be/dff19). [here the pre-print in arXiv](https://arxiv.org/abs/2306.14256).
# mRAT-SQL+GAP
## mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer
Marcelo Archanjo José, Fabio Gagliardi Cozman
The translation of natural language questions to SQL queries has attracted growing attention, in particular in connection with transformers and similar language models. A large number of techniques are geared towards the English language; in this work, we thus investigated translation to SQL when input questions are given in the Portuguese language. To do so, we properly adapted state-of-the-art tools and resources. We changed the RAT-SQL+GAP system by relying on a multilingual BART model (we report tests with other language models), and we produced a translated version of the Spider dataset. Our experiments expose interesting phenomena that arise when non-English languages are targeted; in particular, it is better to train with original and translated training datasets together, even if a single target language is desired. This multilingual BART model fine-tuned with a double-size training dataset (English and Portuguese) achieved 83% of the baseline, making inferences for the Portuguese test dataset. This investigation can help other researchers to produce results in Machine Learning in a language different from English. Our multilingual ready version of RAT-SQL+GAP and the data are available, open-sourced as mRAT-SQL+GAP at: [mRAT-SQL](https://github.com/C4AI/gap-text2sql).
BRACIS 2021: [paper published in Springer Lecture Notes in Computer Science](https://link.springer.com/chapter/10.1007%2F978-3-030-91699-2_35), [here the pre-print in arXiv](https://arxiv.org/abs/2110.03546).
Based on: RAT-SQL+GAP: [Github](https://github.com/awslabs/gap-text2sql). Paper: [AAAI 2021 paper](https://arxiv.org/abs/2012.10309)
提供机构:
Marchanjo
原始信息汇总
mRAT-SQL-FIT
A Multilingual Translator to SQL with Database Schema Pruning to Improve Self-Attention
Marcelo Archanjo Jose, Fabio Gagliardi Cozman
- 数据集描述:该数据集用于多语言SQL转换器,通过数据库模式剪枝来改进自注意力机制。处理长文本序列时,由于自注意力机制的二次内存增加问题,直接影响了自然语言到SQL查询的转换。
- 技术方法:提出了一种训练过程,包括数据库模式剪枝(移除对查询无用的表和列名),并使用多语言方法,对mT5-large模型进行微调,数据增强的Spider数据集包含四种语言:英语、葡萄牙语、西班牙语和法语。
- 性能提升:在验证数据集(Dev)上,精确集匹配准确率从0.718提高到0.736。
- 资源链接:源代码、评估和检查点可在mRAT-SQL获取。
mRAT-SQL+GAP
mRAT-SQL+GAP: A Portuguese Text-to-SQL Transformer
Marcelo Archanjo José, Fabio Gagliardi Cozman
- 数据集描述:该数据集用于葡萄牙语文本到SQL的转换器。针对自然语言问题到SQL查询的转换,特别是与转换器和类似语言模型相关的技术,大多数技术面向英语语言。
- 技术方法:通过依赖多语言BART模型(报告了与其他语言模型的测试),并制作了Spider数据集的翻译版本,进行了适当的工具和资源适应。
- 性能表现:使用双倍大小的训练数据集(英语和葡萄牙语)进行微调的多语言BART模型,在葡萄牙语测试数据集上达到了83%的基线水平。
- 资源链接:多语言版本的RAT-SQL+GAP和数据集可在mRAT-SQL获取。



