埃及阿拉伯语维基百科数据集
收藏arXiv2024-03-31 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2404.00565v1
下载链接
链接失效反馈官方服务:
资源简介:
埃及阿拉伯语维基百科数据集是由克拉克森大学等机构的研究人员创建,专注于埃及阿拉伯语版本的维基百科文章。该数据集包含约736,107篇文章,主要用于研究模板翻译问题。数据集中的文章通过自动化的模板翻译生成,存在内容质量低和不符合埃及文化的问题。研究者通过分析文章的密度、质量和人类贡献,构建了多元机器学习分类器来自动检测模板翻译文章,旨在提高数据集的质量和代表性,解决文化不准确和内容浅显的问题。
The Egyptian Arabic Wikipedia Dataset was created by researchers from institutions including Clarkson University, focusing on articles from the Egyptian Arabic version of Wikipedia. This dataset contains approximately 736,107 articles, and is primarily used for research on template translation issues. Articles in this dataset are generated via automated template translation, which suffers from problems such as low content quality and inconsistency with Egyptian cultural norms. Researchers constructed multivariate machine learning classifiers to automatically detect template-translated articles by analyzing article density, quality, and human contributions, aiming to improve the quality and representativeness of the dataset and resolve issues of cultural inaccuracy and superficial content.
提供机构:
克拉克森大学
创建时间:
2024-03-31



