Rustem-Kaimolla/kazakh-swear-words
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Rustem-Kaimolla/kazakh-swear-words
下载链接
链接失效反馈官方服务:
资源简介:
这是一个低资源语言数据集,包含哈萨克语的脏话、咒骂和冒犯性表达,以及用于二元分类任务的中性示例。数据集分为两个文件:data.jsonl(56个条目,包含单个单词)和sentences.jsonl(63个条目,包含短语和表达)。每个条目包含文本(哈萨克语单词或短语)、标签(1表示脏话/冒犯性,0表示中性)、类别(粗俗、冒犯、侮辱或中性)、语言代码(kk)和严重程度(严重、中等、轻度或无)。数据集适用于内容审核系统、聊天机器人过滤、毒性检测模型、文本分类和低资源语言NLP研究。
This is a low-resource language dataset containing Kazakh profanity, swear words, and offensive expressions along with neutral examples for binary classification tasks. The dataset is divided into two files: data.jsonl (56 entries, individual words) and sentences.jsonl (63 entries, phrases and expressions). Each entry includes text (Kazakh word or phrase), label (1 for profanity/offensive, 0 for neutral), category (vulgar, offensive, insult, or neutral), language code (kk), and severity (severe, moderate, mild, or none). The dataset is suitable for content moderation systems, chatbot filtering, toxicity detection models, text classification, and low-resource language NLP research.
提供机构:
Rustem-Kaimolla



