uznlp-uz/UzMedSentiment
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/uznlp-uz/UzMedSentiment
下载链接
链接失效反馈官方服务:
资源简介:
---
pretty_name: UzMedSentiment
language:
- uz
license: cc-by-4.0
task_categories:
- text-classification
task_ids:
- sentiment-classification
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: UzMedSentiment.tsv
sep: "\t"
---
# UzMedSentiment
## Dataset Summary
UzMedSentiment is an Uzbek medical-domain dataset for sentiment classification, aspect-based sentiment analysis, and auxiliary cue detection. The current release contains **4,791** annotated rows in a single TSV file.
Each row includes:
- source metadata
- a medical-domain text snippet
- an aspect label
- a 3-way sentiment label
- a 5-point polarity score
- adverse drug reaction (ADR) and severity annotations
- negation, speculation, sarcasm, and cue-span annotations
## Supported Tasks
- sentiment classification
- aspect-based sentiment analysis (ABSA)
- ADR signal detection
- negation detection
- speculation detection
- sarcasm detection
## Files
- `UzMedSentiment.tsv`: main release file in tab-separated format
## Dataset Structure
### Columns
| Column | Type | Description |
|---|---|---|
| `id` | integer | Unique identifier |
| `source` | categorical | Data source such as `telegram`, `instagram`, `forum`, `web` |
| `lang` | categorical | Language/script tag in the released file |
| `text` | string | De-identified patient comment or medical-domain text |
| `aspect` | categorical | Aspect label |
| `sentiment` | categorical | `POS`, `NEG`, or `NEU` |
| `polarity_score` | integer | Polarity intensity in the range `-2` to `+2` |
| `adr_flag` | binary | `0` or `1`, whether an ADR is present |
| `severity` | categorical | ADR severity |
| `negation` | binary | `0` or `1` |
| `speculation` | binary | `0` or `1` |
| `sarcasm` | binary | `0` or `1` |
| `cue_span` | string | Cue phrase such as a negation or speculation trigger |
## Tagsets
### Sentiment Labels
| Label | Meaning | Description | Example |
|---|---|---|---|
| `POS` | Positive | Helpful, convenient, or beneficial outcome | `Dori yordam berdi.` |
| `NEG` | Negative | Complaint, harm, problem, or adverse outcome | `Navbat juda uzun.` |
| `NEU` | Neutral | Question, factual statement, or neutral information | `Bu dori xavfsizmi?` |
### Polarity Intensity
| Score | Meaning | Description | Example |
|---|---|---|---|
| `2` | Strong positive | Very good result or clear benefit | `Ajoyib natija!` |
| `1` | Positive | Good or satisfactory outcome | `Og‘riq kamaydi.` |
| `0` | Neutral | Neutral statement or question | `Dorini ichdim.` |
| `-1` | Negative | Negative experience or mild complaint | `Yon ta’sir paydo bo‘ldi.` |
| `-2` | Strong negative | Severe complaint or very bad experience | `Bu dori juda yomon ta’sir qildi.` |
### Aspect Labels
| Aspect | Description | Example context |
|---|---|---|
| `dori` | Drug, medicine, or pharmacotherapy | tabletka, sirop, kapsula, ukol |
| `simptom` | Clinical symptom reported by the patient | og‘riq, isitma, yo‘tal, toshma |
| `muolaja` | Treatment process or intervention | fizioterapiya, in’eksiya, davolanish |
| `diagnostika` | Test, screening, or diagnostic process | analiz, test, rentgen, UTT |
| `shifokor-munosabati` | Doctor or staff behavior and communication | shifokor, hamshira, muomala |
| `xizmat` | Administrative or service interaction | registratura, yozilish, operator |
| `narx` | Financial aspect | narx, to‘lov, arzon/qimmat |
| `kutish-vaqti` | Waiting time or delay | navbat, kechikish, tezkorlik |
| `infratuzilma` | Facility or physical conditions | palata, joylashuv, tozalik, sovuq |
| `parhez` | Diet or regimen | ovqat, rejim, parhez |
### Clinical Risk and Linguistic Cues
| Field | Values | Meaning |
|---|---|---|
| `adr_flag` | `0`, `1` | Whether an adverse drug reaction is present |
| `severity` | `engil`, `o‘rta`, `og‘ir`, `null` | Severity of the adverse event |
| `negation` | `0`, `1` | Negation is present |
| `speculation` | `0`, `1` | Speculation or uncertainty is present |
| `sarcasm` | `0`, `1` | Sarcasm or irony is present |
| `cue_span` | free text | Trigger phrase such as `ehtimol`, `hech qanday`, `yo‘q` |
## Statistics
### Overview
- Rows: **4,791**
- Columns: **13**
- Split: **train** only
- Average text length: **132.12** characters
- Median text length: **96** characters
- Maximum text length: **1,549** characters
- Non-empty `cue_span`: **3,663**
### Sentiment Distribution
| Label | Count |
|---|---:|
| `NEG` | 1,970 |
| `NEU` | 1,783 |
| `POS` | 1,038 |
### Aspect Distribution
| Aspect | Count |
|---|---:|
| `simptom` | 1,319 |
| `muolaja` | 978 |
| `diagnostika` | 626 |
| `dori` | 541 |
| `xizmat` | 469 |
| `shifokor-munosabati` | 424 |
| `infratuzilma` | 197 |
| `narx` | 104 |
| `kutish-vaqti` | 77 |
| `parhez` | 53 |
### Source Distribution
| Source | Count |
|---|---:|
| `forum` | 3,688 |
| `telegram` | 927 |
| `instagram` | 89 |
| `web` | 74 |
| `web-komment` | 13 |
### Additional Annotation Counts
| Field | Positive / non-default value | Count |
|---|---|---:|
| `adr_flag` | `1` | 158 |
| `negation` | `1` | 816 |
| `speculation` | `1` | 1,321 |
| `sarcasm` | `1` | 48 |
## Notes on the Current Release
- No official train, validation, or test split is provided.
- The `id` column is normalized to sequential unique values from `1` to `4,791`.
- Text content was whitespace-normalized in the current release: tab, carriage return, and newline characters were replaced with spaces, and repeated whitespace was collapsed.
- The `lang` column is heterogeneous in the released TSV and contains both `uz` and `uz-latin`.
- The official aspect tagset contains 10 labels, but the released TSV also includes three outlier `aspect` values: `Simptom`, `ijtimoiy`, and `adr_flag`.
- The official `severity` tagset is `engil`, `o‘rta`, `og‘ir`, and `null`, but a small number of rows contain `jiddiy`, `yuqori`, `0`, and `o'rta`.
- One row still uses a Unicode minus sign in `polarity_score` (`−1`) instead of ASCII `-1`.
These inconsistencies are minor in count, but users may want to normalize labels before training or evaluation.
## Usage
```python
from datasets import load_dataset
dataset = load_dataset(
"csv",
data_files={"train": "UzMedSentiment.tsv"},
delimiter="\t",
encoding="utf-8-sig",
)
print(dataset["train"][0])
```
## License
This dataset is released under the `CC-BY-4.0` license.
提供机构:
uznlp-uz



