LHRS-UM-FERI/MENTHOS-dataset-sid
收藏Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/LHRS-UM-FERI/MENTHOS-dataset-sid
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- sl
tags:
- menthos
- sid
- multi-label-classification
size_categories:
- 1K<n<10K
---
# MENTHOS-dataset-sid
## English
### About
MENTHOS-SID is a multi-label text classification dataset for Sensitive Information Detection (SID). Each sample can contain zero, one, or multiple sensitive information types.
### Source Data
- https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/training-data/sid-sample-training-data.csv
- https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/validation-data/sid-validation-data.csv
### Processing
- Input files are concatenated.
- Split into 70% train, 15% validation, 15% test.
- No class balancing/downsampling step is applied.
This means the label frequencies remain naturally uneven across the multi-label targets.
### Schema
- Text column: `data`
- Multi-label columns:
- `si_address`
- `si_bank_acct`
- `si_credit_card`
- `si_email`
- `si_govt_id`
- `si_name`
- `si_password`
- `si_phone_num`
- `si_secret_keys`
- `si_user`
### Splits and Label Counts
The prepared train, validation, and test splits are included with the dataset release.
| label | train (n=2800) | val (n=600) | test (n=600) |
| -------------- | -------------: | ----------: | -----------: |
| si_address | 263 | 68 | 63 |
| si_bank_acct | 285 | 62 | 43 |
| si_credit_card | 269 | 67 | 67 |
| si_email | 288 | 41 | 60 |
| si_govt_id | 252 | 56 | 83 |
| si_name | 306 | 59 | 43 |
| si_password | 304 | 60 | 72 |
| si_phone_num | 282 | 71 | 59 |
| si_secret_keys | 277 | 56 | 65 |
| si_user | 269 | 52 | 58 |
### Citation
```
@misc{borovic_li-dobnik_kranjec_ferme_2026,
title = {MENTHOS-dataset-sid},
author = {Borovic, Li Dobnik, Kranjec, Ferme},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-sid}}
}
```
---
## Slovenščina
### O datasetu
MENTHOS-SID je multi-label dataset za detekcijo občutljivih informacij (SID). Posamezen vzorec lahko vsebuje nič, eno ali več vrst občutljivih podatkov.
### Izvorni podatki
- https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/training-data/sid-sample-training-data.csv
- https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/validation-data/sid-validation-data.csv
### Obdelava
- Vhodni datoteki se združita.
- Razdelitev 70% train, 15% validation, 15% test.
- Posebnega uravnoteženja razredov (downsampling) ni.
To pomeni, da frekvence oznak ostanejo naravno neenakomerne pri multi-label ciljih.
### Shema
- Besedilni stolpec: `data`
- Multi-label stolpci:
- `si_address`
- `si_bank_acct`
- `si_credit_card`
- `si_email`
- `si_govt_id`
- `si_name`
- `si_password`
- `si_phone_num`
- `si_secret_keys`
- `si_user`
### Delitve in število oznak
Pripravljene train, validation in test delitve so vključene v izdajo nabora podatkov.
| label | train (n=2800) | val (n=600) | test (n=600) |
| -------------- | -------------: | ----------: | -----------: |
| si_address | 263 | 68 | 63 |
| si_bank_acct | 285 | 62 | 43 |
| si_credit_card | 269 | 67 | 67 |
| si_email | 288 | 41 | 60 |
| si_govt_id | 252 | 56 | 83 |
| si_name | 306 | 59 | 43 |
| si_password | 304 | 60 | 72 |
| si_phone_num | 282 | 71 | 59 |
| si_secret_keys | 277 | 56 | 65 |
| si_user | 269 | 52 | 58 |
Tabela zgoraj prikazuje število pozitivnih oznak po splitih.
### Citiranje
```
@misc{borovic_li-dobnik_kranjec_ferme_2026,
title = {MENTHOS-dataset-sid},
author = {Borovic, Li Dobnik, Kranjec, Ferme},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-sid}}
}
```
提供机构:
LHRS-UM-FERI



