arkamath2026/sms_spam
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/arkamath2026/sms_spam
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
- found
language_creators:
- crowdsourced
- found
language:
- en
license:
- unknown
multilinguality:
- monolingual
size_categories:
- 1K<n<10K
source_datasets:
- extended|other-nus-sms-corpus
task_categories:
- text-classification
task_ids:
- intent-classification
paperswithcode_id: sms-spam-collection-data-set
pretty_name: SMS Spam Collection Data Set
dataset_info:
config_name: plain_text
features:
- name: sms
dtype: string
- name: label
dtype:
class_label:
names:
'0': ham
'1': spam
splits:
- name: train
num_bytes: 521752
num_examples: 5574
download_size: 358869
dataset_size: 521752
configs:
- config_name: plain_text
data_files:
- split: train
path: plain_text/train-*
default: true
train-eval-index:
- config: plain_text
task: text-classification
task_id: binary_classification
splits:
train_split: train
col_mapping:
sms: text
label: target
metrics:
- type: accuracy
name: Accuracy
- type: f1
name: F1 macro
args:
average: macro
- type: f1
name: F1 micro
args:
average: micro
- type: f1
name: F1 weighted
args:
average: weighted
- type: precision
name: Precision macro
args:
average: macro
- type: precision
name: Precision micro
args:
average: micro
- type: precision
name: Precision weighted
args:
average: weighted
- type: recall
name: Recall macro
args:
average: macro
- type: recall
name: Recall micro
args:
average: micro
- type: recall
name: Recall weighted
args:
average: weighted
---
# Dataset Card for [Dataset Name]
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- **Repository:**
- **Paper:** Almeida, T.A., Gomez Hidalgo, J.M., Yamakami, A. Contributions to the study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (ACM DOCENG'11), Mountain View, CA, USA, 2011.
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research.
It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
English
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
- sms: the sms message
- label: indicating if the sms message is ham or spam, ham means it is not spam
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
@inproceedings{Almeida2011SpamFiltering,
title={Contributions to the Study of SMS Spam Filtering: New Collection and Results},
author={Tiago A. Almeida and Jose Maria Gomez Hidalgo and Akebo Yamakami},
year={2011},
booktitle = "Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11)",
}
### Contributions
Thanks to [@czabo](https://github.com/czabo) for adding this dataset.
annotations_creators:
- 众包(crowdsourced)
- 公开采集(found)
language_creators:
- 众包(crowdsourced)
- 公开采集(found)
language:
- 英语(en)
license:
- 未知
multilinguality:
- 单语言(monolingual)
size_categories:
- 1000 < n < 10000
source_datasets:
- 扩展|其他-nus-sms语料库(nus-sms-corpus)
task_categories:
- 文本分类(text-classification)
task_ids:
- 意图分类(intent-classification)
paperswithcode_id: sms-spam-collection-data-set
pretty_name: SMS垃圾短信收集数据集(SMS Spam Collection Data Set)
dataset_info:
config_name: 纯文本(plain_text)
features:
- name: sms
dtype: 字符串(string)
- name: label
dtype:
class_label:
names:
'0': 正常短信(ham)
'1': 垃圾短信(spam)
splits:
- name: 训练集(train)
num_bytes: 521752
num_examples: 5574
download_size: 358869
dataset_size: 521752
configs:
- config_name: 纯文本(plain_text)
data_files:
- split: 训练集(train)
path: plain_text/train-*
default: true
train-eval-index:
- config: 纯文本(plain_text)
task: 文本分类(text-classification)
task_id: 二元分类(binary_classification)
splits:
train_split: 训练集(train)
col_mapping:
sms: 文本特征(text)
label: 目标标签(target)
metrics:
- type: 准确率(accuracy)
name: 准确率
- type: F1值(f1)
name: 宏平均F1值
args:
average: macro
- type: F1值(f1)
name: 微平均F1值
args:
average: micro
- type: F1值(f1)
name: 加权平均F1值
args:
average: weighted
- type: 精确率(precision)
name: 宏平均精确率
args:
average: macro
- type: 精确率(precision)
name: 微平均精确率
args:
average: micro
- type: 精确率(precision)
name: 加权平均精确率
args:
average: weighted
- type: 召回率(recall)
name: 宏平均召回率
args:
average: macro
- type: 召回率(recall)
name: 微平均召回率
args:
average: micro
- type: 召回率(recall)
name: 加权平均召回率
args:
average: weighted
---
# [数据集名称]数据集卡片
## 目录
- [数据集概述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建初衷](#curation-rationale)
- [源数据](#source-data)
- [注释](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集概述
- **主页**:http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
- **代码仓库**:
- **相关论文**:Almeida, T.A., Gomez Hidalgo, J.M., Yamakami, A. 短信垃圾过滤研究的新数据集与成果贡献 // 2011年ACM文档工程研讨会论文集(ACM DOCENG'11),美国加利福尼亚州山景城,2011年。
- **排行榜**:
- **联系人**:
### 数据集摘要
SMS垃圾短信收集数据集v1是面向手机垃圾短信研究公开采集的带标注短信消息集合。该集合包含5574条真实的、未经过编码的英语短信,根据是否为垃圾短信被标记为正常(ham)或垃圾(spam)两类。
### 支持任务与排行榜
[需补充更多信息]
### 语言
英语
## 数据集结构
### 数据实例
[需补充更多信息]
### 数据字段
- sms:短信消息内容
- label:用于标识短信是否为垃圾短信的标签,ham代表非垃圾短信
### 数据划分
[需补充更多信息]
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据采集与标准化
[需补充更多信息]
#### 源语言生产者是谁?
[需补充更多信息]
### 注释
#### 注释流程
[需补充更多信息]
#### 注释者是谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
bibtex
@inproceedings{Almeida2011SpamFiltering,
title={Contributions to the Study of SMS Spam Filtering: New Collection and Results},
author={Tiago A. Almeida and Jose Maria Gomez Hidalgo and Akebo Yamakami},
year={2011},
booktitle = "Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11)",
}
### 贡献
感谢[@czabo](https://github.com/czabo) 为本数据集的添加工作。
提供机构:
arkamath2026



