Moroccan YouTube Corpus (MYC)
收藏arXiv2023-11-07 更新2024-06-21 收录
下载链接:
https://github.com/MouadJb/MYC
下载链接
链接失效反馈官方服务:
资源简介:
Moroccan YouTube Corpus (MYC)是由苏丹穆莱·斯利曼大学创建的一个大型公共数据集,专门用于摩洛哥方言的情感分析。该数据集包含20000条手动标注的摩洛哥方言文本,涵盖了阿拉伯字母和拉丁字母两种书写方式,反映了摩洛哥在线内容的语言多样性。创建过程中,研究团队从YouTube上收集了多种主题的评论,并由母语为摩洛哥方言的标注者进行情感分类。此数据集不仅包括文本数据,还提供了摩洛哥方言的停用词列表,适用于机器学习和自然语言处理领域的研究,旨在解决摩洛哥方言情感分析中的复杂性和多语言特性问题。
Moroccan YouTube Corpus (MYC) is a large-scale public dataset developed by Sultan Moulay Slimane University, exclusively tailored for sentiment analysis of Moroccan dialects. It comprises 20,000 manually annotated Moroccan dialectal texts, spanning both Arabic and Latin writing scripts, which encapsulates the linguistic diversity of Moroccan online content. During the dataset construction, the research team collected comments across diverse topics from YouTube, and conducted sentiment classification annotation by native Moroccan dialect speakers. Beyond the textual data, this dataset also provides a stopword list for Moroccan dialects. It is applicable to research in the fields of machine learning and natural language processing, aiming to address the complexity and multilingual characteristics in Moroccan dialect sentiment analysis.
提供机构:
苏丹穆莱·斯利曼大学
创建时间:
2023-03-28



