OffMix-3L
收藏arXiv2023-11-25 更新2024-06-21 收录
下载链接:
https://github.com/LanguageTechnologyLab/OffMix-3L
下载链接
链接失效反馈官方服务:
资源简介:
OffMix-3L是由乔治梅森大学创建的一个包含1001条数据的创新型三语混合数据集,主要用于攻击性语言识别。该数据集涵盖了孟加拉语、英语和印地语三种语言的混合文本,旨在解决多语言社区中攻击性语言的识别问题。数据集的创建过程涉及一组熟练掌握这三种语言的学生,他们被要求在社交媒体上模拟日常话题的帖子,并鼓励使用语言混合。OffMix-3L的应用领域主要集中在提高在线内容的安全性和社区环境的和谐性,通过精确识别和处理攻击性语言,为多语言环境下的内容审核提供支持。
OffMix-3L is an innovative trilingual mixed dataset developed by George Mason University, containing 1001 entries and primarily designed for offensive language identification. This dataset covers mixed texts in Bengali, English and Hindi, aiming to address the challenge of offensive language recognition in multilingual communities. The creation of OffMix-3L involved a cohort of students proficient in these three languages, who were asked to simulate daily topic posts on social media and encouraged to use mixed languages. The main application fields of OffMix-3L focus on enhancing online content safety and promoting the harmony of community environments, providing support for content moderation in multilingual settings by accurately identifying and processing offensive language.
提供机构:
乔治梅森大学
创建时间:
2023-10-27



