ShoaibSSM/AIvsHuman-SuperCorpus

Name: ShoaibSSM/AIvsHuman-SuperCorpus
Creator: ShoaibSSM
Published: 2025-11-27 18:13:21
License: 暂无描述

Hugging Face2025-11-27 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ShoaibSSM/AIvsHuman-SuperCorpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification size_categories: - 1M<n<10M --- # 🧠 AIvsHuman-SuperCorpus **A 2.7M-example massive corpus to distinguish AI-generated text from Human-written text.** ## 📦 Dataset Summary **AIvsHuman-SuperCorpus** is a *large-scale, multi-source*, **2.72 million example** dataset designed for **AI-vs-Human text classification**, safety research, LLM detection, hallucination analysis, and authenticity scoring. This dataset merges *11 major public datasets* across both AI-generated and human-written sources, cleaned, deduplicated, and balanced using a custom high-performance streaming pipeline. It enables training models similar to: * 🔹 *GPTZero-style detectors* * 🔹 *OpenAI classifier replacements* * 🔹 *Microsoft's DeBERTa-based detectors* * 🔹 *LLM-authorship attribution models* This is one of the **largest publicly available corpora** for AI-content detection. ## 🧩 Dataset Size | Split | Total Examples | AI | Human | | --------- | -------------- | --------- | --------- | | **train** | 2,178,857 | 889,984 | 1,288,873 | | **val** | 273,066 | 111,034 | 162,032 | | **test** | 272,046 | 111,300 | 160,746 | | **TOTAL** | **2,724,0xx** | **1.11M** | **1.61M** | *(Exact numbers may vary slightly depending on dedup pass.)* # 📁 Dataset Structure Each row follows a **simple and consistent schema**: ```json { "id": "96f41b01-0707-465d-8856-069b30d43c1f", "source": "dolly15k", "text": "Camels use the fat in their humps to...", "label_ai": 1, "meta": { "length_chars": 105 } } ``` ### Fields | Field | Type | Description | | ------------------- | ------ | ----------------------------------------------------------------------- | | `id` | string | Unique identifier | | `source` | string | Origin dataset (e.g., *openhermes*, *slimorca*, *agnews*, *yelp*, etc.) | | `text` | string | The text sample (cleaned and normalized) | | `label_ai` | int | `1 = AI-generated`, `0 = Human-written` | | `meta.length_chars` | int | Character length for filtering/metadata | # 🏗 Source Datasets ### **AI-generated corpora** * OpenHermes-2.5 * SlimOrca * Dolly-15k * UltraChat 200k * WizardLM Evol-Instruct 70k * (Cleaned & flattened via custom extractors) ### **Human-written corpora** * AGNews * Amazon Reviews * BookSum * CNN/DailyMail * WikiText-103 * Yelp Reviews Total Raw Sources: ``` AI : 2,040,591 lines Human : 1,894,545 lines ``` After dedup + filtering: ``` Final merged: ~2.7M lines ``` # 🧹 Preprocessing Pipeline All preprocessing was done using a **zero-RAM / streaming-first** pipeline: ### ✔ Streamed JSONL reading (no memory blowup) ### ✔ Global SHA256 deduplication ### ✔ Length filtering (< 30 chars removed) ### ✔ Normalization and whitespace cleaning ### ✔ Balanced split using hash-based deterministic sharding ### ✔ Final train/val/test split ensures **zero leakage** # 🧪 Example Usage ## Load in Python ```python from datasets import load_dataset ds = load_dataset("ShoaibSSM/AIvsHuman-SuperCorpus") print(ds["train"][0]) ``` ## Fine-tuning a classifier (DeBERTa recommended) ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tok = AutoTokenizer.from_pretrained("microsoft/deberta-v3-large") model = AutoModelForSequenceClassification.from_pretrained( "microsoft/deberta-v3-large", num_labels=2 ) ``` # 🔥 Ideal Use Cases ### ✓ AI-generated content detection ### ✓ Misinformation / deepfake text filtering ### ✓ Academic integrity / exam proctoring models ### ✓ LLM hallucination analysis ### ✓ Authorship detection research ### ✓ LLM safety classifier training ### ✓ “Human-likeness” scoring for generated text # ⚠️ Limitations * Not all “AI text” reflects modern 2024–2025 LLM behavior * Human datasets include mixed-quality, domain-specific writing * Not intended for censorship or punitive decisions * English-centric * Assumes binary AI/Human classification (does not include hybrid human-edited AI text) # 📚 Citation If you use this dataset in research, please cite it: ``` @dataset{ShoaibSSM_AIvsHuman_SuperCorpus_2025, title = {AIvsHuman-SuperCorpus}, author = {Shoaib Sadiq Salehmohamed}, year = {2025}, url = {https://huggingface.co/datasets/ShoaibSSM/AIvsHuman-SuperCorpus}, note = {A 2.7M-example corpus for AI vs Human text classification} } ``` # 📄 License This dataset is released under the **Apache 2.0**. Individual source datasets retain their original licenses. # 💬 Contact Creator: **Shoaib Sadiq Salehmohamed (ShoaibSSM)** Feel free to open issues or discussions on the HuggingFace repo.

提供机构：

ShoaibSSM

5,000+

优质数据集

54 个

任务类型

进入经典数据集