noumenon-labs/WORM

Name: noumenon-labs/WORM
Creator: noumenon-labs
Published: 2026-02-20 18:48:20
License: 暂无描述

Hugging Face2026-02-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/noumenon-labs/WORM

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en task_categories: - text-classification tags: - classification - ai - detection - human size_categories: - 1M<n<10M --- # 🐛 WORM Dataset **Wait, Original or Machine?** *A large-scale dataset for AI text detection.* WORM stands for **Wait, Original or Machine?** It also plays on *worm* — the food caught by the Earlybird classifier. Built for binary AI-text detection at scale. --- ## Overview * **Task:** Binary classification * **Goal:** Detect Human vs AI text * **Total documents:** 2,046,995 * **Format:** CSV * **Columns:** `text`, `label` --- ## File Structure | Column | Type | Description | | ------ | ------ | ---------------------- | | text | string | Raw text sample | | label | int | 0 = Human, 1 = Machine | ### Labels * `0` → Original (Human-written) * `1` → Machine (AI-generated) --- ## Token Length Statistics * **Minimum:** 12 * **Average:** 372 * **90% under:** 653 * **95% under:** 776 * **99% under:** 1118 * **Max:** 4780 ### Training Notes * `max_length=512` → safer, lower memory * `max_length=776` → covers 95% of samples * On Colab T4, reduce batch size (2–4) if using 776 --- # Data Preparation Guide Below are optional preprocessing steps. Use them carefully. Some cleaning choices may affect detection signals. --- ## 1️⃣ Normalize Quotation Marks (Optional) AI text often uses curly quotes: * “ ” * ‘ ’ You may convert them to straight quotes: * " * ' ### Why? Curly quotes can leak formatting patterns that models may overfit on. ### Example ```python import pandas as pd df = pd.read_csv("worm.csv") df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) ``` --- ## 2️⃣ Remove Rows Starting With Special Characters Some rows may begin with symbols such as: * `#` * `*` * `_` * unusual unicode characters These can be formatting artifacts. ### Remove rows where text starts with non-alphanumeric characters: ```python import re df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] ``` This keeps rows that start with a letter or number. Adjust the regex if needed. --- ## 3️⃣ Deduplicate Text Samples Duplicate rows can bias training. ### Exact Deduplication ```python df = df.drop_duplicates(subset="text") ``` ### Check how many were removed ```python print("Remaining rows:", len(df)) ``` --- ## 4️⃣ Trim Whitespace ```python df["text"] = df["text"].str.strip() ``` --- ## 5️⃣ Remove Very Short Samples (Optional) If needed: ```python df = df[df["text"].str.split().str.len() >= 12] ``` This matches the dataset’s minimum token threshold. --- # Important Note on Cleaning Be careful not to remove stylistic signals that help detect AI. For example: * Over-normalizing punctuation may reduce detection accuracy. * Removing formatting patterns may remove real signals. * Semantic deduplication is **not recommended** if your goal is style detection. WORM focuses on **writing style**, not topic similarity. --- # Example Full Cleaning Script ```python import pandas as pd import re df = pd.read_csv("worm.csv") # Normalize quotes df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) # Strip whitespace df["text"] = df["text"].str.strip() # Remove rows starting with special characters df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] # Remove short samples df = df[df["text"].str.split().str.len() >= 12] # Deduplicate df = df.drop_duplicates(subset="text") df.to_csv("worm_cleaned.csv", index=False) ``` --- ## Intended Use * Train AI detection classifiers * Benchmark detection systems * Research in stylometry * Fine-tune transformer models --- ## Naming Concept * **WORM** → *Wait, Original or Machine?* * **Earlybird** → The model that catches the worm Detect machine text early.

许可证：Apache-2.0 语言： - 英语任务类别： - 文本分类（text-classification）标签： - 分类 - 人工智能（AI） - 检测 - 人类文本样本规模：100万<样本量<1000万 # 🐛 WORM数据集 **待辨：原创抑或机器生成？** *大规模AI文本检测数据集* WORM全称即**待辨：原创抑或机器生成？（Wait, Original or Machine?）**，同时呼应“蠕虫（worm）”——Earlybird分类器所捕获的“猎物”。本数据集专为大规模二元AI文本检测任务构建。 ## 概述 * **任务：二元分类（binary classification）** * **目标：检测人类创作文本与AI生成文本** * **总文档数：2,046,995** * **格式：CSV** * **列字段：`text`、`label`** ## 文件结构 | 列名 | 类型 | 描述 | | ---- | ------ | ------------------------ | | text | 字符串 | 原始文本样本 | | label| 整数 | 0 = 人类原创文本，1 = AI生成文本 | ### 标签说明 * `0` → 人类原创文本（Original, Human-written） * `1` → AI生成文本（Machine, AI-generated） ## 令牌（Token）长度统计 * **最小值：12** * **平均值：372** * **90%样本长度不超过：653** * **95%样本长度不超过：776** * **99%样本长度不超过：1118** * **最大值：4780** ### 训练注意事项 * 设置`max_length=512`——内存占用更低，运行更稳妥 * 设置`max_length=776`——可覆盖95%的样本 * 在Colab T4环境中，若使用776的最大长度，需将批量大小降至2~4 ## 数据预处理指南以下为可选预处理步骤，请谨慎操作：部分清洗操作可能会影响检测信号。 --- ## 1️⃣ 标准化引号（可选） AI生成文本常使用弯引号： * “ ” * ‘ ’ 可将其转换为直引号： * " * ' ### 设计缘由弯引号可能会泄露格式模式，导致模型过拟合。 ### 示例代码 python import pandas as pd df = pd.read_csv("worm.csv") df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) ## 2️⃣ 删除以特殊字符开头的行部分行可能以`#`、`*`、`_`等符号或非常规Unicode字符开头，此类内容多为格式伪影。 ### 过滤规则保留以字母或数字开头的行，代码如下： python import re df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] 可根据需求调整正则表达式。 ## 3️⃣ 文本样本去重重复行可能会导致训练偏差。 ### 精确去重代码 python df = df.drop_duplicates(subset="text") ### 检查剩余样本量 python print("剩余行数:", len(df)) ## 4️⃣ 修剪空白字符 python df["text"] = df["text"].str.strip() ## 5️⃣ 移除过短样本（可选）若有需求，可过滤掉分词数少于12的样本，与数据集预设的最小令牌阈值保持一致： python df = df[df["text"].str.split().str.len() >= 12] --- ## 清洗操作重要提示请谨慎操作，避免移除有助于AI检测的风格特征： * 过度标准化标点可能降低检测精度 * 移除格式模式可能丢失真实检测信号 * 若目标为风格检测，不建议进行语义去重 WORM数据集聚焦于**写作风格**，而非主题相似度。 --- ## 完整清洗脚本示例 python import pandas as pd import re df = pd.read_csv("worm.csv") # 标准化引号 df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) # 修剪空白字符 df["text"] = df["text"].str.strip() # 删除以特殊字符开头的行 df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] # 移除过短样本 df = df[df["text"].str.split().str.len() >= 12] # 去重 df = df.drop_duplicates(subset="text") df.to_csv("worm_cleaned.csv", index=False) --- ## 预期用途 * 训练AI文本检测分类器 * 基准测试检测系统性能 * 文体计量学（stylometry）研究 * 微调Transformer（Transformer）模型 --- ## 命名概念 * **WORM** → 待辨：原创抑或机器生成？（Wait, Original or Machine?） * **Earlybird** → 早鸟模型——捕获蠕虫的猎手，寓意尽早检测AI生成文本

提供机构：

noumenon-labs

5,000+

优质数据集

54 个

任务类型

进入经典数据集