five

LHRS-UM-FERI/MENTHOS-dataset-sid

收藏
Hugging Face2026-04-05 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/LHRS-UM-FERI/MENTHOS-dataset-sid
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - sl tags: - menthos - sid - multi-label-classification size_categories: - 1K<n<10K --- # MENTHOS-dataset-sid ## English ### About MENTHOS-SID is a multi-label text classification dataset for Sensitive Information Detection (SID). Each sample can contain zero, one, or multiple sensitive information types. ### Source Data - https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/training-data/sid-sample-training-data.csv - https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/validation-data/sid-validation-data.csv ### Processing - Input files are concatenated. - Split into 70% train, 15% validation, 15% test. - No class balancing/downsampling step is applied. This means the label frequencies remain naturally uneven across the multi-label targets. ### Schema - Text column: `data` - Multi-label columns: - `si_address` - `si_bank_acct` - `si_credit_card` - `si_email` - `si_govt_id` - `si_name` - `si_password` - `si_phone_num` - `si_secret_keys` - `si_user` ### Splits and Label Counts The prepared train, validation, and test splits are included with the dataset release. | label | train (n=2800) | val (n=600) | test (n=600) | | -------------- | -------------: | ----------: | -----------: | | si_address | 263 | 68 | 63 | | si_bank_acct | 285 | 62 | 43 | | si_credit_card | 269 | 67 | 67 | | si_email | 288 | 41 | 60 | | si_govt_id | 252 | 56 | 83 | | si_name | 306 | 59 | 43 | | si_password | 304 | 60 | 72 | | si_phone_num | 282 | 71 | 59 | | si_secret_keys | 277 | 56 | 65 | | si_user | 269 | 52 | 58 | ### Citation ``` @misc{borovic_li-dobnik_kranjec_ferme_2026, title = {MENTHOS-dataset-sid}, author = {Borovic, Li Dobnik, Kranjec, Ferme}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-sid}} } ``` --- ## Slovenščina ### O datasetu MENTHOS-SID je multi-label dataset za detekcijo občutljivih informacij (SID). Posamezen vzorec lahko vsebuje nič, eno ali več vrst občutljivih podatkov. ### Izvorni podatki - https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/training-data/sid-sample-training-data.csv - https://media.githubusercontent.com/media/nv-morpheus/Morpheus/refs/heads/branch-25.10/models/datasets/validation-data/sid-validation-data.csv ### Obdelava - Vhodni datoteki se združita. - Razdelitev 70% train, 15% validation, 15% test. - Posebnega uravnoteženja razredov (downsampling) ni. To pomeni, da frekvence oznak ostanejo naravno neenakomerne pri multi-label ciljih. ### Shema - Besedilni stolpec: `data` - Multi-label stolpci: - `si_address` - `si_bank_acct` - `si_credit_card` - `si_email` - `si_govt_id` - `si_name` - `si_password` - `si_phone_num` - `si_secret_keys` - `si_user` ### Delitve in število oznak Pripravljene train, validation in test delitve so vključene v izdajo nabora podatkov. | label | train (n=2800) | val (n=600) | test (n=600) | | -------------- | -------------: | ----------: | -----------: | | si_address | 263 | 68 | 63 | | si_bank_acct | 285 | 62 | 43 | | si_credit_card | 269 | 67 | 67 | | si_email | 288 | 41 | 60 | | si_govt_id | 252 | 56 | 83 | | si_name | 306 | 59 | 43 | | si_password | 304 | 60 | 72 | | si_phone_num | 282 | 71 | 59 | | si_secret_keys | 277 | 56 | 65 | | si_user | 269 | 52 | 58 | Tabela zgoraj prikazuje število pozitivnih oznak po splitih. ### Citiranje ``` @misc{borovic_li-dobnik_kranjec_ferme_2026, title = {MENTHOS-dataset-sid}, author = {Borovic, Li Dobnik, Kranjec, Ferme}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/LHRS-UM-FERI/MENTHOS-dataset-sid}} } ```
提供机构:
LHRS-UM-FERI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作