tanaos/synthetic-spam-detection-dataset-italian

Name: tanaos/synthetic-spam-detection-dataset-italian
Creator: tanaos
Published: 2026-03-28 18:54:51
License: 暂无描述

Hugging Face2026-03-28 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/tanaos/synthetic-spam-detection-dataset-italian

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - it license: mit tags: - spam-detection - text-classification - content-moderation - synthetic-data - tanaos pretty_name: tanaos-spam-detection-italian Training Dataset task_categories: - text-classification task_ids: - acceptability-classification size_categories: - 10K<n<20K --- <p align="center"> <img src="https://raw.githubusercontent.com/tanaos/.github/master/assets/logo.png" width="250px" alt="Tanaos – Train task specific LLMs without training data, for offline NLP and Text Classification"> </p> # Tanaos Spam Detection Italian Training Dataset This dataset was created synthetically by Tanaos with the [Artifex](https://github.com/tanaos/artifex) Python library. The dataset is designed to **train and evaluate spam detection systems** — models that detect, classify, or filter unsolicited commercial advertisement, fraudulent messages, or other unwanted content in text form — in Italian. Our Italian spam detection model, [tanaos-spam-detection-italian](https://huggingface.co/tanaos/tanaos-spam-detection-italian), was trained on this dataset. ## Dataset Summary The dataset contains text samples labeled as either `0` (`not_spam`) or `1` (`spam`). The following categories are considered spam: 1. Unsolicited commercial advertisement or non-commercial proselytizing. 2. Fraudulent schemes. including get-rich-quick and pyramid schemes. 3. Phishing attempts. unrealistic offers or announcements. 4. Content with deceptive or misleading information. 5. Malware or harmful links. 6. Adult content or explicit material. 7. Excessive use of capitalization or punctuation to grab attention. --- ## How to Use ```python from datasets import load_dataset dataset = load_dataset("tanaos/synthetic-spam-detection-dataset-italian") print(dataset["train"][0]) ``` ## Intended Use This dataset is intended for training and evaluating spam detection models. Common use cases: - Training machine learning models to classify text messages as spam or not spam. - Evaluating the performance of spam detection algorithms. - Fine-tuning pre-trained language models for spam detection tasks.

提供机构：

tanaos

5,000+

优质数据集

54 个

任务类型

进入经典数据集