five

Supplemental Files for the article "Slovak morphological tokenizer using the Byte-Pair Encoding algorithm"

收藏
DataCite Commons2025-06-01 更新2024-08-26 收录
下载链接:
https://figshare.com/articles/dataset/Supplemental_Files_for_the_article_Slovak_morphological_tokenizer_using_the_Byte-Pair_Encoding_algorithm_/26805724/1
下载链接
链接失效反馈
官方服务:
资源简介:
This repository contains multiple ZIP archives focused on Slovak language processing, specifically in subword tokenization, model pre-training and model fine-tuning. The first archive (Tokenizers) includes the SKMT Tokenizer, word root dictionaries, and PureBPE tokenization files. The second archive (Text Tokenization and Analysis) contains a script for tokenizing text with three different tokenizers, along with statistical analysis and comparison results. The third archive (Source Codes and Datasets for Training and Fine-Tuning Models) provides source codes and datasets for training and fine-tuning RoBERTa-based models, including sentiment datasets from SlovakBERT and an STS dataset. The final archive (Pre-trained models) contains pre-trained RoBERTa models, SK_BPE and SK_Morph, both trained for 10 epochs.
提供机构:
figshare
创建时间:
2024-08-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作