UzSentiNER v1: A 30,000-Sentence Uzbek Corpus for Joint Sentiment Analysis and NER (6 Entity Types) with Emoji Signals
收藏Mendeley Data2026-04-18 收录
下载链接:
https://data.mendeley.com/datasets/htkpynj4vb
下载链接
链接失效反馈官方服务:
资源简介:
UzSentiNER v1 is a 30,000-sentence Uzbek corpus designed for joint sentiment analysis (Positive/Negative/Neutral) and named entity recognition with six entity types: PERSON, POSITION, ORGANIZATION, LOCATION, DATE, PRODUCT.
The dataset combines 24,000 synthetic sentences (generated and calibrated with two rounds of manual quality control) and 6,000 linguist-curated sentences, merged into a single unified schema. Emoji usage is included to simulate real user writing patterns; emojis are placed naturally within sentences (not appended at the end).
The package provides multiple formats for reproducible research: XLSX, CSV, JSONL, plus token-level BIO tags and CoNLL (train/val/test) exports for NER training. Character-level entity spans are also included.
A manual audit on a random subset of 1,000 sentences reports 95.1% acceptable samples; remaining issues relate only to sentiment calibration, with 0% NER-type errors observed.
This dataset is intended for benchmarking and developing Uzbek NLP models for sentiment classification, NER, and multi-task learning. See README and documentation files for schema, annotation guidelines, validation report, and reproducibility instructions.
创建时间:
2026-02-23



