MOROCO (MOldavian and ROmanian Dialectal COrpus)
收藏OpenDataLab2026-05-31 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/MOROCO
下载链接
链接失效反馈官方服务:
资源简介:
在这项工作中,我们介绍了摩尔达维亚和罗马尼亚方言语料库 (MOROCO),可在 https://github.com/butnaruandrei/MOROCO 免费下载。该语料库包含从新闻领域收集的 33564 个文本样本(具有超过 1000 万个标记)。样本属于以下六个主题之一:文化、金融、政治、科学、体育和科技。数据集分为 21719 个样本用于训练,5921 个样本用于验证,另外 5924 个样本用于测试。对于每个样本,我们提供相应的方言和类别标签。这使我们能够对几个分类任务进行实证研究,例如(i)摩尔达维亚语与罗马尼亚语文本样本的二元区分,(ii)按主题的方言内多类分类和(iii)按主题的跨方言多类分类.我们使用基于字符串内核的浅层方法以及基于包含 Squeeze-and-Excitation 块的字符级卷积神经网络的新型深度方法进行实验。我们还展示并分析了我们的最佳性能模型在命名实体删除前后最具辨别力的特征。
In this work, we introduce the Moldovan and Romanian Dialect Corpus (MOROCO), which is freely available for download at https://github.com/butnaruandrei/MOROCO. This corpus contains 33,564 text samples collected from the news domain, with over 10 million tokens. Each sample belongs to one of the following six topics: culture, finance, politics, science, sports, and technology. The dataset is split into 21,719 training samples, 5,921 validation samples, and an additional 5,924 test samples. For each sample, we provide the corresponding dialect and category labels. This enables us to conduct empirical studies on several classification tasks, such as (i) binary classification between Moldovan and Romanian text samples, (ii) intra-dialect multi-class classification by topic, and (iii) cross-dialect multi-class classification by topic. We conduct experiments using both shallow methods based on string kernels and a novel deep method based on character-level convolutional neural networks with Squeeze-and-Excitation blocks. We also present and analyze the most discriminative features of our best-performing model before and after named entity removal.
提供机构:
OpenDataLab
创建时间:
2022-05-23
搜集汇总
数据集介绍

背景与挑战
背景概述
MOROCO是一个公开的摩尔达维亚和罗马尼亚方言语料库,包含33564个新闻文本样本,涵盖六个主题类别,并划分为训练、验证和测试集。该数据集用于支持方言区分和主题分类的实证研究,如二元方言分类和多类主题分类任务。
以上内容由遇见数据集搜集并总结生成



