A Multi-Source Synthetic Dataset for Uzbek Sentiment Analysis, Named Entity Recognition, and Normalization
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/khnmnp4t7v
下载链接
链接失效反馈官方服务:
资源简介:
This repository provides a multi-source synthetic Uzbek dataset for (i) sentiment classification (Positive/Neutral/Negative) and (ii) named entity recognition with PER/LOC/ORG/DATE labels, plus auxiliary resources for emoji-aware modeling and text normalization. The main file contains 10,000 unique sentences with aligned entity spans (surface forms + types) and an emoji-aware score in [-1,1]. Emoji usage is source-dependent (news ~15%, social ~75%, dialog ~55%) to better reflect real communication styles. All data were generated programmatically from rule-based templates and lexicons; no copyrighted or real user content was used. Primary formats are CSV and JSONL (XLSX provided only for convenience).
创建时间:
2026-01-02



