five

inokusan/human_ai_text_classification

收藏
Hugging Face2026-03-12 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/inokusan/human_ai_text_classification
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: Human AI Text Classification language: - en license: other task_categories: - text-classification task_ids: - binary-classification size_categories: - 100K<n<1M tags: - ai-generated-text - human-vs-ai - text-classification - llm-detection - english dataset_info: features: - name: text dtype: string - name: label dtype: int64 splits: - name: train num_bytes: 97511886 num_examples: 90648 - name: test num_bytes: 24329778 num_examples: 22662 download_size: 68869429 dataset_size: 121841664 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- # Human AI Text Classification ## Dataset Summary `human_ai_text_classification` is a binary text classification dataset for distinguishing human-written text from AI-generated text. It was created by combining three public datasets, standardizing them into a common schema, balancing the class labels, removing duplicate texts, and performing a stratified 80/20 train-test split. Labels: - `0` = human-written text - `1` = AI-generated text ## Dataset Structure ### Data Fields - `text`: the input text - `label`: binary class label - `0` for human-written text - `1` for AI-generated text ### Splits - `train`: 90,648 rows - `test`: 22,662 rows ### Label Distribution This final dataset is globally balanced: - Total rows: `113,310` - Human (`0`): `56,655` - AI (`1`): `56,655` Split-level balance: - Train: `45,324` human, `45,324` AI - Test: `11,331` human, `11,331` AI ## Dataset Creation ### Source Datasets This dataset was built from the following original sources: 1. `NicolaiSivesind/human-vs-machine` https://huggingface.co/datasets/NicolaiSivesind/human-vs-machine 2. `thedrcat/daigt-v2-train-dataset` https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset 3. `shahxeebhassan/human_vs_ai_sentences` https://huggingface.co/datasets/shahxeebhassan/human_vs_ai_sentences ### Processing Steps The dataset was created with the following pipeline: 1. Load the three source datasets. 2. Standardize columns to `text` and `label`. 3. Standardize labels to: - `0` for human - `1` for AI 4. Sample each source to keep it internally balanced: - `NicolaiSivesind/human-vs-machine`: `20,000` human + `20,000` AI - `thedrcat/daigt-v2-train-dataset`: `17,497` human + `17,497` AI - `shahxeebhassan/human_vs_ai_sentences`: `20,000` human + `20,000` AI 5. Merge all sampled subsets. 6. Remove duplicate texts using `drop_duplicates(subset=["text"])`. 7. Rebalance globally after deduplication to preserve exact class balance. 8. Shuffle the full dataset. 9. Perform a stratified 80/20 train-test split. ### Resulting Dataset Size - Before deduplication: `114,994` - After deduplication: `113,886` - Final balanced size: `113,310` ## Intended Use This dataset is intended for: - training baseline AI-text detectors - benchmarking binary human-vs-AI text classification - experiments on generalization across mixed-source human and machine-generated writing ## Limitations - The dataset combines multiple source datasets with different collection methods and writing styles. - It should not be treated as a universal detector for all LLM-generated text. - Label quality depends on the correctness of the original source datasets. - Some source datasets are themselves aggregated from earlier datasets or competition resources. ## License and Attribution This dataset is a derived compilation of multiple public datasets. Please review the original dataset pages for the applicable licenses, usage terms, and attribution requirements before reuse or redistribution. Original dataset references: - `NicolaiSivesind/human-vs-machine` https://huggingface.co/datasets/NicolaiSivesind/human-vs-machine - `thedrcat/daigt-v2-train-dataset` https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset - `shahxeebhassan/human_vs_ai_sentences` https://huggingface.co/datasets/shahxeebhassan/human_vs_ai_sentences ## Loading the Dataset ```python from datasets import load_dataset dataset = load_dataset("inokusan/human_ai_text_classification") print(dataset) ```
提供机构:
inokusan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作