MOROCO (MOldavian and ROmanian Dialectal COrpus)

Name: MOROCO (MOldavian and ROmanian Dialectal COrpus)
Creator: OpenDataLab
Published: 2026-05-31 06:30:10
License: 暂无描述

OpenDataLab2026-05-31 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/MOROCO

下载链接

链接失效反馈

官方服务：

资源简介：

在这项工作中，我们介绍了摩尔达维亚和罗马尼亚方言语料库 (MOROCO)，可在 https://github.com/butnaruandrei/MOROCO 免费下载。该语料库包含从新闻领域收集的 33564 个文本样本（具有超过 1000 万个标记）。样本属于以下六个主题之一：文化、金融、政治、科学、体育和科技。数据集分为 21719 个样本用于训练，5921 个样本用于验证，另外 5924 个样本用于测试。对于每个样本，我们提供相应的方言和类别标签。这使我们能够对几个分类任务进行实证研究，例如（i）摩尔达维亚语与罗马尼亚语文本样本的二元区分，（ii）按主题的方言内多类分类和（iii）按主题的跨方言多类分类.我们使用基于字符串内核的浅层方法以及基于包含 Squeeze-and-Excitation 块的字符级卷积神经网络的新型深度方法进行实验。我们还展示并分析了我们的最佳性能模型在命名实体删除前后最具辨别力的特征。

In this work, we introduce the Moldovan and Romanian Dialect Corpus (MOROCO), which is freely available for download at https://github.com/butnaruandrei/MOROCO. This corpus contains 33,564 text samples collected from the news domain, with over 10 million tokens. Each sample belongs to one of the following six topics: culture, finance, politics, science, sports, and technology. The dataset is split into 21,719 training samples, 5,921 validation samples, and an additional 5,924 test samples. For each sample, we provide the corresponding dialect and category labels. This enables us to conduct empirical studies on several classification tasks, such as (i) binary classification between Moldovan and Romanian text samples, (ii) intra-dialect multi-class classification by topic, and (iii) cross-dialect multi-class classification by topic. We conduct experiments using both shallow methods based on string kernels and a novel deep method based on character-level convolutional neural networks with Squeeze-and-Excitation blocks. We also present and analyze the most discriminative features of our best-performing model before and after named entity removal.

提供机构：

OpenDataLab

创建时间：

2022-05-23

搜集汇总

数据集介绍