Supplemental Files for the article "Slovak morphological tokenizer using the Byte-Pair Encoding algorithm"

Name: Supplemental Files for the article "Slovak morphological tokenizer using the Byte-Pair Encoding algorithm"
Creator: figshare
Published: 2025-06-01 04:35:48
License: 暂无描述

DataCite Commons2025-06-01 更新2024-08-26 收录

下载链接：

https://figshare.com/articles/dataset/Supplemental_Files_for_the_article_Slovak_morphological_tokenizer_using_the_Byte-Pair_Encoding_algorithm_/26805724/1

下载链接

链接失效反馈

官方服务：

资源简介：

This repository contains multiple ZIP archives focused on Slovak language processing, specifically in subword tokenization, model pre-training and model fine-tuning. The first archive (Tokenizers) includes the SKMT Tokenizer, word root dictionaries, and PureBPE tokenization files. The second archive (Text Tokenization and Analysis) contains a script for tokenizing text with three different tokenizers, along with statistical analysis and comparison results. The third archive (Source Codes and Datasets for Training and Fine-Tuning Models) provides source codes and datasets for training and fine-tuning RoBERTa-based models, including sentiment datasets from SlovakBERT and an STS dataset. The final archive (Pre-trained models) contains pre-trained RoBERTa models, SK_BPE and SK_Morph, both trained for 10 epochs.

提供机构：

figshare

创建时间：

2024-08-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集