five

noumenon-labs/WORM

收藏
Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/noumenon-labs/WORM
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en task_categories: - text-classification tags: - classification - ai - detection - human size_categories: - 1M<n<10M --- # 🐛 WORM Dataset **Wait, Original or Machine?** *A large-scale dataset for AI text detection.* WORM stands for **Wait, Original or Machine?** It also plays on *worm* — the food caught by the Earlybird classifier. Built for binary AI-text detection at scale. --- ## Overview * **Task:** Binary classification * **Goal:** Detect Human vs AI text * **Total documents:** 2,046,995 * **Format:** CSV * **Columns:** `text`, `label` --- ## File Structure | Column | Type | Description | | ------ | ------ | ---------------------- | | text | string | Raw text sample | | label | int | 0 = Human, 1 = Machine | ### Labels * `0` → Original (Human-written) * `1` → Machine (AI-generated) --- ## Token Length Statistics * **Minimum:** 12 * **Average:** 372 * **90% under:** 653 * **95% under:** 776 * **99% under:** 1118 * **Max:** 4780 ### Training Notes * `max_length=512` → safer, lower memory * `max_length=776` → covers 95% of samples * On Colab T4, reduce batch size (2–4) if using 776 --- # Data Preparation Guide Below are optional preprocessing steps. Use them carefully. Some cleaning choices may affect detection signals. --- ## 1️⃣ Normalize Quotation Marks (Optional) AI text often uses curly quotes: * “ ” * ‘ ’ You may convert them to straight quotes: * " * ' ### Why? Curly quotes can leak formatting patterns that models may overfit on. ### Example ```python import pandas as pd df = pd.read_csv("worm.csv") df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) ``` --- ## 2️⃣ Remove Rows Starting With Special Characters Some rows may begin with symbols such as: * `#` * `*` * `_` * unusual unicode characters These can be formatting artifacts. ### Remove rows where text starts with non-alphanumeric characters: ```python import re df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] ``` This keeps rows that start with a letter or number. Adjust the regex if needed. --- ## 3️⃣ Deduplicate Text Samples Duplicate rows can bias training. ### Exact Deduplication ```python df = df.drop_duplicates(subset="text") ``` ### Check how many were removed ```python print("Remaining rows:", len(df)) ``` --- ## 4️⃣ Trim Whitespace ```python df["text"] = df["text"].str.strip() ``` --- ## 5️⃣ Remove Very Short Samples (Optional) If needed: ```python df = df[df["text"].str.split().str.len() >= 12] ``` This matches the dataset’s minimum token threshold. --- # Important Note on Cleaning Be careful not to remove stylistic signals that help detect AI. For example: * Over-normalizing punctuation may reduce detection accuracy. * Removing formatting patterns may remove real signals. * Semantic deduplication is **not recommended** if your goal is style detection. WORM focuses on **writing style**, not topic similarity. --- # Example Full Cleaning Script ```python import pandas as pd import re df = pd.read_csv("worm.csv") # Normalize quotes df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) # Strip whitespace df["text"] = df["text"].str.strip() # Remove rows starting with special characters df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] # Remove short samples df = df[df["text"].str.split().str.len() >= 12] # Deduplicate df = df.drop_duplicates(subset="text") df.to_csv("worm_cleaned.csv", index=False) ``` --- ## Intended Use * Train AI detection classifiers * Benchmark detection systems * Research in stylometry * Fine-tune transformer models --- ## Naming Concept * **WORM** → *Wait, Original or Machine?* * **Earlybird** → The model that catches the worm Detect machine text early.

许可证:Apache-2.0 语言: - 英语 任务类别: - 文本分类(text-classification) 标签: - 分类 - 人工智能(AI) - 检测 - 人类文本 样本规模:100万<样本量<1000万 # 🐛 WORM数据集 **待辨:原创抑或机器生成?** *大规模AI文本检测数据集* WORM全称即**待辨:原创抑或机器生成?(Wait, Original or Machine?)**,同时呼应“蠕虫(worm)”——Earlybird分类器所捕获的“猎物”。本数据集专为大规模二元AI文本检测任务构建。 ## 概述 * **任务:二元分类(binary classification)** * **目标:检测人类创作文本与AI生成文本** * **总文档数:2,046,995** * **格式:CSV** * **列字段:`text`、`label`** ## 文件结构 | 列名 | 类型 | 描述 | | ---- | ------ | ------------------------ | | text | 字符串 | 原始文本样本 | | label| 整数 | 0 = 人类原创文本,1 = AI生成文本 | ### 标签说明 * `0` → 人类原创文本(Original, Human-written) * `1` → AI生成文本(Machine, AI-generated) ## 令牌(Token)长度统计 * **最小值:12** * **平均值:372** * **90%样本长度不超过:653** * **95%样本长度不超过:776** * **99%样本长度不超过:1118** * **最大值:4780** ### 训练注意事项 * 设置`max_length=512`——内存占用更低,运行更稳妥 * 设置`max_length=776`——可覆盖95%的样本 * 在Colab T4环境中,若使用776的最大长度,需将批量大小降至2~4 ## 数据预处理指南 以下为可选预处理步骤,请谨慎操作:部分清洗操作可能会影响检测信号。 --- ## 1️⃣ 标准化引号(可选) AI生成文本常使用弯引号: * “ ” * ‘ ’ 可将其转换为直引号: * " * ' ### 设计缘由 弯引号可能会泄露格式模式,导致模型过拟合。 ### 示例代码 python import pandas as pd df = pd.read_csv("worm.csv") df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) ## 2️⃣ 删除以特殊字符开头的行 部分行可能以`#`、`*`、`_`等符号或非常规Unicode字符开头,此类内容多为格式伪影。 ### 过滤规则 保留以字母或数字开头的行,代码如下: python import re df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] 可根据需求调整正则表达式。 ## 3️⃣ 文本样本去重 重复行可能会导致训练偏差。 ### 精确去重代码 python df = df.drop_duplicates(subset="text") ### 检查剩余样本量 python print("剩余行数:", len(df)) ## 4️⃣ 修剪空白字符 python df["text"] = df["text"].str.strip() ## 5️⃣ 移除过短样本(可选) 若有需求,可过滤掉分词数少于12的样本,与数据集预设的最小令牌阈值保持一致: python df = df[df["text"].str.split().str.len() >= 12] --- ## 清洗操作重要提示 请谨慎操作,避免移除有助于AI检测的风格特征: * 过度标准化标点可能降低检测精度 * 移除格式模式可能丢失真实检测信号 * 若目标为风格检测,不建议进行语义去重 WORM数据集聚焦于**写作风格**,而非主题相似度。 --- ## 完整清洗脚本示例 python import pandas as pd import re df = pd.read_csv("worm.csv") # 标准化引号 df["text"] = ( df["text"] .str.replace("“", '"', regex=False) .str.replace("”", '"', regex=False) .str.replace("‘", "'", regex=False) .str.replace("’", "'", regex=False) ) # 修剪空白字符 df["text"] = df["text"].str.strip() # 删除以特殊字符开头的行 df = df[df["text"].str.match(r"^[A-Za-z0-9]", na=False)] # 移除过短样本 df = df[df["text"].str.split().str.len() >= 12] # 去重 df = df.drop_duplicates(subset="text") df.to_csv("worm_cleaned.csv", index=False) --- ## 预期用途 * 训练AI文本检测分类器 * 基准测试检测系统性能 * 文体计量学(stylometry)研究 * 微调Transformer(Transformer)模型 --- ## 命名概念 * **WORM** → 待辨:原创抑或机器生成?(Wait, Original or Machine?) * **Earlybird** → 早鸟模型——捕获蠕虫的猎手,寓意尽早检测AI生成文本
提供机构:
noumenon-labs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作