noumenon-labs/WORM
收藏Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/noumenon-labs/WORM
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
task_categories:
- text-classification
tags:
- classification
- ai
- detection
- human
size_categories:
- 1M<n<10M
---
# 🐛 WORM Dataset
**Wait, Original or Machine?**
*A large-scale dataset for AI text detection.*
WORM stands for **Wait, Original or Machine?**
It also plays on *worm* — the food caught by the Earlybird classifier.
Built for binary AI-text detection at scale.
---
## Overview
* **Task:** Binary classification
* **Goal:** Detect Human vs AI text
* **Total documents:** 2,046,995
* **Format:** CSV
* **Columns:** `text`, `label`
---
## File Structure
| Column | Type | Description |
| ------ | ------ | ---------------------- |
| text | string | Raw text sample |
| label | int | 0 = Human, 1 = Machine |
### Labels
* `0` → Original (Human-written)
* `1` → Machine (AI-generated)
---
## Token Length Statistics
* **Minimum:** 12
* **Average:** 372
* **90% under:** 653
* **95% under:** 776
* **99% under:** 1118
* **Max:** 4780
### Training Notes
* `max_length=512` → safer, lower memory
* `max_length=776` → covers 95% of samples
* On Colab T4, reduce batch size (2–4) if using 776
---
# Data Preparation Guide
Below are optional preprocessing steps.
Use them carefully. Some cleaning choices may affect detection signals.
---
## 1️⃣ Normalize Quotation Marks (Optional)
AI text often uses curly quotes:
* “ ”
* ‘ ’
You may convert them to straight quotes:
* "
* '
### Why?
Curly quotes can leak formatting patterns that models may overfit on.
### Example
```python
import pandas as pd
df = pd.read_csv("worm.csv")
df["text"] = (
df["text"]
.str.replace("“", '"', regex=False)
.str.replace("”", '"', regex=False)
.str.replace("‘", "'", regex=False)
.str.replace("’", "'", regex=False)
)
```
---
## 2️⃣ Remove Rows Starting With Special Characters
Some rows may begin with symbols such as:
* `#`
* `*`
* `_`
* unusual unicode characters
These can be formatting artifacts.
### Remove rows where text starts with non-alphanumeric characters:
```python
import re
df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)]
```
This keeps rows that start with a letter or number.
Adjust the regex if needed.
---
## 3️⃣ Deduplicate Text Samples
Duplicate rows can bias training.
### Exact Deduplication
```python
df = df.drop_duplicates(subset="text")
```
### Check how many were removed
```python
print("Remaining rows:", len(df))
```
---
## 4️⃣ Trim Whitespace
```python
df["text"] = df["text"].str.strip()
```
---
## 5️⃣ Remove Very Short Samples (Optional)
If needed:
```python
df = df[df["text"].str.split().str.len() >= 12]
```
This matches the dataset’s minimum token threshold.
---
# Important Note on Cleaning
Be careful not to remove stylistic signals that help detect AI.
For example:
* Over-normalizing punctuation may reduce detection accuracy.
* Removing formatting patterns may remove real signals.
* Semantic deduplication is **not recommended** if your goal is style detection.
WORM focuses on **writing style**, not topic similarity.
---
# Example Full Cleaning Script
```python
import pandas as pd
import re
df = pd.read_csv("worm.csv")
# Normalize quotes
df["text"] = (
df["text"]
.str.replace("“", '"', regex=False)
.str.replace("”", '"', regex=False)
.str.replace("‘", "'", regex=False)
.str.replace("’", "'", regex=False)
)
# Strip whitespace
df["text"] = df["text"].str.strip()
# Remove rows starting with special characters
df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)]
# Remove short samples
df = df[df["text"].str.split().str.len() >= 12]
# Deduplicate
df = df.drop_duplicates(subset="text")
df.to_csv("worm_cleaned.csv", index=False)
```
---
## Intended Use
* Train AI detection classifiers
* Benchmark detection systems
* Research in stylometry
* Fine-tune transformer models
---
## Naming Concept
* **WORM** → *Wait, Original or Machine?*
* **Earlybird** → The model that catches the worm
Detect machine text early.
许可证:Apache-2.0
语言:
- 英语
任务类别:
- 文本分类(text-classification)
标签:
- 分类
- 人工智能(AI)
- 检测
- 人类文本
样本规模:100万<样本量<1000万
# 🐛 WORM数据集
**待辨:原创抑或机器生成?**
*大规模AI文本检测数据集*
WORM全称即**待辨:原创抑或机器生成?(Wait, Original or Machine?)**,同时呼应“蠕虫(worm)”——Earlybird分类器所捕获的“猎物”。本数据集专为大规模二元AI文本检测任务构建。
## 概述
* **任务:二元分类(binary classification)**
* **目标:检测人类创作文本与AI生成文本**
* **总文档数:2,046,995**
* **格式:CSV**
* **列字段:`text`、`label`**
## 文件结构
| 列名 | 类型 | 描述 |
| ---- | ------ | ------------------------ |
| text | 字符串 | 原始文本样本 |
| label| 整数 | 0 = 人类原创文本,1 = AI生成文本 |
### 标签说明
* `0` → 人类原创文本(Original, Human-written)
* `1` → AI生成文本(Machine, AI-generated)
## 令牌(Token)长度统计
* **最小值:12**
* **平均值:372**
* **90%样本长度不超过:653**
* **95%样本长度不超过:776**
* **99%样本长度不超过:1118**
* **最大值:4780**
### 训练注意事项
* 设置`max_length=512`——内存占用更低,运行更稳妥
* 设置`max_length=776`——可覆盖95%的样本
* 在Colab T4环境中,若使用776的最大长度,需将批量大小降至2~4
## 数据预处理指南
以下为可选预处理步骤,请谨慎操作:部分清洗操作可能会影响检测信号。
---
## 1️⃣ 标准化引号(可选)
AI生成文本常使用弯引号:
* “ ”
* ‘ ’
可将其转换为直引号:
* "
* '
### 设计缘由
弯引号可能会泄露格式模式,导致模型过拟合。
### 示例代码
python
import pandas as pd
df = pd.read_csv("worm.csv")
df["text"] = (
df["text"]
.str.replace("“", '"', regex=False)
.str.replace("”", '"', regex=False)
.str.replace("‘", "'", regex=False)
.str.replace("’", "'", regex=False)
)
## 2️⃣ 删除以特殊字符开头的行
部分行可能以`#`、`*`、`_`等符号或非常规Unicode字符开头,此类内容多为格式伪影。
### 过滤规则
保留以字母或数字开头的行,代码如下:
python
import re
df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)]
可根据需求调整正则表达式。
## 3️⃣ 文本样本去重
重复行可能会导致训练偏差。
### 精确去重代码
python
df = df.drop_duplicates(subset="text")
### 检查剩余样本量
python
print("剩余行数:", len(df))
## 4️⃣ 修剪空白字符
python
df["text"] = df["text"].str.strip()
## 5️⃣ 移除过短样本(可选)
若有需求,可过滤掉分词数少于12的样本,与数据集预设的最小令牌阈值保持一致:
python
df = df[df["text"].str.split().str.len() >= 12]
---
## 清洗操作重要提示
请谨慎操作,避免移除有助于AI检测的风格特征:
* 过度标准化标点可能降低检测精度
* 移除格式模式可能丢失真实检测信号
* 若目标为风格检测,不建议进行语义去重
WORM数据集聚焦于**写作风格**,而非主题相似度。
---
## 完整清洗脚本示例
python
import pandas as pd
import re
df = pd.read_csv("worm.csv")
# 标准化引号
df["text"] = (
df["text"]
.str.replace("“", '"', regex=False)
.str.replace("”", '"', regex=False)
.str.replace("‘", "'", regex=False)
.str.replace("’", "'", regex=False)
)
# 修剪空白字符
df["text"] = df["text"].str.strip()
# 删除以特殊字符开头的行
df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)]
# 移除过短样本
df = df[df["text"].str.split().str.len() >= 12]
# 去重
df = df.drop_duplicates(subset="text")
df.to_csv("worm_cleaned.csv", index=False)
---
## 预期用途
* 训练AI文本检测分类器
* 基准测试检测系统性能
* 文体计量学(stylometry)研究
* 微调Transformer(Transformer)模型
---
## 命名概念
* **WORM** → 待辨:原创抑或机器生成?(Wait, Original or Machine?)
* **Earlybird** → 早鸟模型——捕获蠕虫的猎手,寓意尽早检测AI生成文本
提供机构:
noumenon-labs



