A Multi-Source Synthetic Dataset for Uzbek Sentiment Analysis, Named Entity Recognition, and Normalization

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://data.mendeley.com/datasets/khnmnp4t7v

下载链接

链接失效反馈

官方服务：

资源简介：

This repository provides a multi-source synthetic Uzbek dataset for (i) sentiment classification (Positive/Neutral/Negative) and (ii) named entity recognition with PER/LOC/ORG/DATE labels, plus auxiliary resources for emoji-aware modeling and text normalization. The main file contains 10,000 unique sentences with aligned entity spans (surface forms + types) and an emoji-aware score in [-1,1]. Emoji usage is source-dependent (news ~15%, social ~75%, dialog ~55%) to better reflect real communication styles. All data were generated programmatically from rule-based templates and lexicons; no copyrighted or real user content was used. Primary formats are CSV and JSONL (XLSX provided only for convenience).

创建时间：

2026-01-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集