UzSentiNER v1: A 30,000-Sentence Uzbek Corpus for Joint Sentiment Analysis and NER (6 Entity Types) with Emoji Signals

Mendeley Data2026-04-18 收录

下载链接：

https://data.mendeley.com/datasets/htkpynj4vb

下载链接

链接失效反馈

官方服务：

资源简介：

UzSentiNER v1 is a 30,000-sentence Uzbek corpus designed for joint sentiment analysis (Positive/Negative/Neutral) and named entity recognition with six entity types: PERSON, POSITION, ORGANIZATION, LOCATION, DATE, PRODUCT. The dataset combines 24,000 synthetic sentences (generated and calibrated with two rounds of manual quality control) and 6,000 linguist-curated sentences, merged into a single unified schema. Emoji usage is included to simulate real user writing patterns; emojis are placed naturally within sentences (not appended at the end). The package provides multiple formats for reproducible research: XLSX, CSV, JSONL, plus token-level BIO tags and CoNLL (train/val/test) exports for NER training. Character-level entity spans are also included. A manual audit on a random subset of 1,000 sentences reports 95.1% acceptable samples; remaining issues relate only to sentiment calibration, with 0% NER-type errors observed. This dataset is intended for benchmarking and developing Uzbek NLP models for sentiment classification, NER, and multi-task learning. See README and documentation files for schema, annotation guidelines, validation report, and reproducibility instructions.

创建时间：

2026-02-23