five

Spanish to Mexican Sign Language (MSL) glosses corpus for NLP tasks.

收藏
Figshare2025-03-02 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Spanish_to_Mexican_Sign_Language_MSL_glosses_corpus_for_NLP_tasks_/28519580
下载链接
链接失效反馈
官方服务:
资源简介:
This work shares a dataset that contains Spanish (SPA) to Mexican Sign Language (MSL) glosses -transcripted MSL- pairs of sentences for a downstream task. The methodology used to prepare the shared dataset considered the construction of SPA-to-MSL corpus with a specific representation of the Spanish language for MSL interpretation. The proposed corpus is a referencedataset for evaluating diverse neural machine translation (NMT) system variants. With the support of grammatical MSL books and advice from MSL interpreters, this study developed a 3000 sentence pairs SPA-to-MSL dataset. The distribution of 3000 sentences in the corpus follows the linguistic composition of the Spanish language. With the aim of testing the functionality of the corpus as a data source for NMT, two neural transformers models for Spanish paraphrasis were used to test the usability of the proposed dataset. The first NMT model uses a Helsinki-NLP SPA-SPA transformer developed by the Language Technologies Research Group at the University of Helsinki. The second NMT model considers a Spa-to-Spa pre-trained neural transformer presented as a BARTOapproach. Both evaluations considered a transfer learning strategy, which has been demonstrated to be effective for modeling low-resource languages achieving state of art results in translation quality.Spanish-MSL glosses dataset -IT is a .xlsx format file that contains 3000 Spanish-MSL glosses pairs. To use dataset it needs to be converted to .csv formatModel M1- It is a Colab file that contains the programming methodology for finetunning Helsinki-NLP/opus-mt-es-es available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured. The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M2- It is a Colab file that contains the programming methodology for finetunning vgaraujov/bart-base-spanish available on Hugging Face: https://huggingface.co/Helsinki-NLP/opus-mt-es-es. It was fine-tuned on MSL-Spanish glosses corpus. It uses transformers library from Hugging Face, the trainer API for translation and evaluation. To evaluate the quality of translation, ROUGE, TER, BLEU were measured.The model card of M1 is available at: https://huggingface.co/VaniLara/esp-to-lsm-barto-model, there is a guide on how to use it on the transformers library. If you want to use the Colab you will need to create an access token, use your own google drive account and create a repo on hugging face.Model M1-split-version and Model M2-split-version is the dataset splitted in 80% training, 10% validation and 10% testing. Model cards are avilable at: https://huggingface.co/vania2911/esp-to-lsm-barto-model and https://huggingface.co/vania2911/esp-to-lsm-model-split.Translations M1 and M2 contain the reference and predicted translations for each model.esp-msl analysis. Statistical analysis of the dataset
创建时间:
2025-03-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作