Supplemental Files for the article "Slovak morphological tokenizer using the Byte-Pair Encoding algorithm"
收藏DataCite Commons2025-06-01 更新2024-08-26 收录
下载链接:
https://figshare.com/articles/dataset/Supplemental_Files_for_the_article_Slovak_morphological_tokenizer_using_the_Byte-Pair_Encoding_algorithm_/26805724/1
下载链接
链接失效反馈官方服务:
资源简介:
This repository contains multiple ZIP archives focused on Slovak language processing, specifically in subword tokenization, model pre-training and model fine-tuning. The first archive (Tokenizers) includes the SKMT Tokenizer, word root dictionaries, and PureBPE tokenization files. The second archive (Text Tokenization and Analysis) contains a script for tokenizing text with three different tokenizers, along with statistical analysis and comparison results. The third archive (Source Codes and Datasets for Training and Fine-Tuning Models) provides source codes and datasets for training and fine-tuning RoBERTa-based models, including sentiment datasets from SlovakBERT and an STS dataset. The final archive (Pre-trained models) contains pre-trained RoBERTa models, SK_BPE and SK_Morph, both trained for 10 epochs.
提供机构:
figshare
创建时间:
2024-08-22



