five

uznlp-uz/UzMedSentiment

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/UzMedSentiment
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: UzMedSentiment language: - uz license: cc-by-4.0 task_categories: - text-classification task_ids: - sentiment-classification size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: UzMedSentiment.tsv sep: "\t" --- # UzMedSentiment ## Dataset Summary UzMedSentiment is an Uzbek medical-domain dataset for sentiment classification, aspect-based sentiment analysis, and auxiliary cue detection. The current release contains **4,791** annotated rows in a single TSV file. Each row includes: - source metadata - a medical-domain text snippet - an aspect label - a 3-way sentiment label - a 5-point polarity score - adverse drug reaction (ADR) and severity annotations - negation, speculation, sarcasm, and cue-span annotations ## Supported Tasks - sentiment classification - aspect-based sentiment analysis (ABSA) - ADR signal detection - negation detection - speculation detection - sarcasm detection ## Files - `UzMedSentiment.tsv`: main release file in tab-separated format ## Dataset Structure ### Columns | Column | Type | Description | |---|---|---| | `id` | integer | Unique identifier | | `source` | categorical | Data source such as `telegram`, `instagram`, `forum`, `web` | | `lang` | categorical | Language/script tag in the released file | | `text` | string | De-identified patient comment or medical-domain text | | `aspect` | categorical | Aspect label | | `sentiment` | categorical | `POS`, `NEG`, or `NEU` | | `polarity_score` | integer | Polarity intensity in the range `-2` to `+2` | | `adr_flag` | binary | `0` or `1`, whether an ADR is present | | `severity` | categorical | ADR severity | | `negation` | binary | `0` or `1` | | `speculation` | binary | `0` or `1` | | `sarcasm` | binary | `0` or `1` | | `cue_span` | string | Cue phrase such as a negation or speculation trigger | ## Tagsets ### Sentiment Labels | Label | Meaning | Description | Example | |---|---|---|---| | `POS` | Positive | Helpful, convenient, or beneficial outcome | `Dori yordam berdi.` | | `NEG` | Negative | Complaint, harm, problem, or adverse outcome | `Navbat juda uzun.` | | `NEU` | Neutral | Question, factual statement, or neutral information | `Bu dori xavfsizmi?` | ### Polarity Intensity | Score | Meaning | Description | Example | |---|---|---|---| | `2` | Strong positive | Very good result or clear benefit | `Ajoyib natija!` | | `1` | Positive | Good or satisfactory outcome | `Og‘riq kamaydi.` | | `0` | Neutral | Neutral statement or question | `Dorini ichdim.` | | `-1` | Negative | Negative experience or mild complaint | `Yon ta’sir paydo bo‘ldi.` | | `-2` | Strong negative | Severe complaint or very bad experience | `Bu dori juda yomon ta’sir qildi.` | ### Aspect Labels | Aspect | Description | Example context | |---|---|---| | `dori` | Drug, medicine, or pharmacotherapy | tabletka, sirop, kapsula, ukol | | `simptom` | Clinical symptom reported by the patient | og‘riq, isitma, yo‘tal, toshma | | `muolaja` | Treatment process or intervention | fizioterapiya, in’eksiya, davolanish | | `diagnostika` | Test, screening, or diagnostic process | analiz, test, rentgen, UTT | | `shifokor-munosabati` | Doctor or staff behavior and communication | shifokor, hamshira, muomala | | `xizmat` | Administrative or service interaction | registratura, yozilish, operator | | `narx` | Financial aspect | narx, to‘lov, arzon/qimmat | | `kutish-vaqti` | Waiting time or delay | navbat, kechikish, tezkorlik | | `infratuzilma` | Facility or physical conditions | palata, joylashuv, tozalik, sovuq | | `parhez` | Diet or regimen | ovqat, rejim, parhez | ### Clinical Risk and Linguistic Cues | Field | Values | Meaning | |---|---|---| | `adr_flag` | `0`, `1` | Whether an adverse drug reaction is present | | `severity` | `engil`, `o‘rta`, `og‘ir`, `null` | Severity of the adverse event | | `negation` | `0`, `1` | Negation is present | | `speculation` | `0`, `1` | Speculation or uncertainty is present | | `sarcasm` | `0`, `1` | Sarcasm or irony is present | | `cue_span` | free text | Trigger phrase such as `ehtimol`, `hech qanday`, `yo‘q` | ## Statistics ### Overview - Rows: **4,791** - Columns: **13** - Split: **train** only - Average text length: **132.12** characters - Median text length: **96** characters - Maximum text length: **1,549** characters - Non-empty `cue_span`: **3,663** ### Sentiment Distribution | Label | Count | |---|---:| | `NEG` | 1,970 | | `NEU` | 1,783 | | `POS` | 1,038 | ### Aspect Distribution | Aspect | Count | |---|---:| | `simptom` | 1,319 | | `muolaja` | 978 | | `diagnostika` | 626 | | `dori` | 541 | | `xizmat` | 469 | | `shifokor-munosabati` | 424 | | `infratuzilma` | 197 | | `narx` | 104 | | `kutish-vaqti` | 77 | | `parhez` | 53 | ### Source Distribution | Source | Count | |---|---:| | `forum` | 3,688 | | `telegram` | 927 | | `instagram` | 89 | | `web` | 74 | | `web-komment` | 13 | ### Additional Annotation Counts | Field | Positive / non-default value | Count | |---|---|---:| | `adr_flag` | `1` | 158 | | `negation` | `1` | 816 | | `speculation` | `1` | 1,321 | | `sarcasm` | `1` | 48 | ## Notes on the Current Release - No official train, validation, or test split is provided. - The `id` column is normalized to sequential unique values from `1` to `4,791`. - Text content was whitespace-normalized in the current release: tab, carriage return, and newline characters were replaced with spaces, and repeated whitespace was collapsed. - The `lang` column is heterogeneous in the released TSV and contains both `uz` and `uz-latin`. - The official aspect tagset contains 10 labels, but the released TSV also includes three outlier `aspect` values: `Simptom`, `ijtimoiy`, and `adr_flag`. - The official `severity` tagset is `engil`, `o‘rta`, `og‘ir`, and `null`, but a small number of rows contain `jiddiy`, `yuqori`, `0`, and `o'rta`. - One row still uses a Unicode minus sign in `polarity_score` (`−1`) instead of ASCII `-1`. These inconsistencies are minor in count, but users may want to normalize labels before training or evaluation. ## Usage ```python from datasets import load_dataset dataset = load_dataset( "csv", data_files={"train": "UzMedSentiment.tsv"}, delimiter="\t", encoding="utf-8-sig", ) print(dataset["train"][0]) ``` ## License This dataset is released under the `CC-BY-4.0` license.
提供机构:
uznlp-uz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作