The database of English words in Croatian.xlsx

Name: The database of English words in Croatian.xlsx
Creator: Bogunović, Irena; Kučić, Mario
Published: 2022-06-07 00:00:00
License: 暂无描述

Figshare2022-06-07 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/dataset/The_database_of_English_words_in_Croatian_xlsx/20014364/1

下载链接

链接失效反馈

官方服务：

资源简介：

To build a dataset to train and test the model, 60,000 words were manually labelled according to language membership by three independent evaluators. N-gram feature representation was used in combination with a linear Support Vector Machine classification algorithm (SVM) (Smola & Schölkopf, 2004) to extract English words from the ENGRI corpus (Bogunović & Kučić, 2021; Kučić, 2021). An F1 score of 0.9669 was achieved on the test set. The database contains 9,453 English words as well as their absolute and relative frequencies.

为构建用于模型训练与测试的数据集，三名独立评估人员依据语言归属对60000个词汇开展了人工标注工作。本研究采用N-gram特征表示结合线性支持向量机（Support Vector Machine，SVM）分类算法（Smola & Schölkopf, 2004），从ENGRI语料库（Bogunović & Kučić, 2021; Kučić, 2021）中提取英语词汇。该模型在测试集上取得了0.9669的F1值。该数据集共包含9453个英语词汇及其绝对频率与相对频率。

提供机构：

Bogunović, Irena; Kučić, Mario

创建时间：

2022-06-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集