The database of English words in Croatian.xlsx
收藏Figshare2022-06-07 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/The_database_of_English_words_in_Croatian_xlsx/20014364/1
下载链接
链接失效反馈官方服务:
资源简介:
To build a dataset to train and test the model, 60,000 words were manually labelled according to language membership by three independent evaluators. N-gram feature representation was used in combination with a linear Support Vector Machine classification algorithm (SVM) (Smola & Schölkopf, 2004) to extract English words from the ENGRI corpus (Bogunović & Kučić, 2021; Kučić, 2021). An F1 score of 0.9669 was achieved on the test set. The database contains 9,453 English words as well as their absolute and relative frequencies.
为构建用于模型训练与测试的数据集,三名独立评估人员依据语言归属对60000个词汇开展了人工标注工作。本研究采用N-gram特征表示结合线性支持向量机(Support Vector Machine,SVM)分类算法(Smola & Schölkopf, 2004),从ENGRI语料库(Bogunović & Kučić, 2021; Kučić, 2021)中提取英语词汇。该模型在测试集上取得了0.9669的F1值。该数据集共包含9453个英语词汇及其绝对频率与相对频率。
提供机构:
Bogunović, Irena; Kučić, Mario
创建时间:
2022-06-07



