MorphBPE Evaluation Datasets
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/llm-lab-org/MorphBPE
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了针对MorphBPE在形态编辑距离和形态一致性方面的表现评估,覆盖了英语、俄语、匈牙利语和阿拉伯语等多种语言。该数据集专门用于衡量MorphBPE与标准BPE在形态准确性及训练效率方面的对比表现。该数据集基于大约140亿个标记,涵盖了不同模型大小(分别为30亿和100亿个参数)的训练。其任务是对形态结构进行分词和评估。
This dataset comprises performance evaluations of MorphBPE in terms of morphological edit distance and morphological consistency, covering multiple languages including English, Russian, Hungarian, Arabic, and others. Specifically, this dataset is designed to compare the performance of MorphBPE and standard BPE with respect to morphological accuracy and training efficiency. Built on approximately 14 billion tokens, the dataset includes training experiments with two distinct model sizes: 3 billion and 10 billion parameters respectively. The core task of this dataset is tokenization and evaluation of morphological structures.
提供机构:
LLM Lab



