five

validname/reddit-ai-detection

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/validname/reddit-ai-detection
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: en license: other task_categories: - text-classification tags: - ai-detection - reddit - human-written - nlp size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: pre_2022 path: data/pre_2022-* - split: post_2022 path: data/post_2022-* dataset_info: features: - name: text dtype: large_string - name: subreddit dtype: large_string - name: domain dtype: large_string - name: post_type dtype: large_string - name: year dtype: int64 - name: word_count dtype: int64 - name: length_bin dtype: large_string - name: score dtype: int64 - name: created_utc dtype: int64 - name: id dtype: large_string splits: - name: pre_2022 num_bytes: 14860525 num_examples: 19960 - name: post_2022 num_bytes: 13977964 num_examples: 19985 download_size: 16462550 dataset_size: 28838489 --- # Reddit AI-Detection Dataset Human-written Reddit posts and comments collected for AI-generated text detection research (CSCI 544 – *Who Wrote This?*). All records in the **pre-2022** split pre-date widespread LLM deployment and can be treated as ground-truth human-authored text for detector calibration. ## Splits | File | Records | Period | |------|--------:|--------| | `data/reddit_pre_2022.zip` | 2,257 | 2005 – 2021 | | `data/reddit_post_2022.zip` | 79 | 2022 – 2026 | | `data/reddit_combined.zip` | 2,336 | 2005 – 2026 | ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | str | Post / comment body | | `subreddit` | str | Source subreddit | | `domain` | str | technology · news · science · finance · entertainment | | `post_type` | str | `submission` or `comment` | | `year` | int | UTC year of posting | | `word_count` | int | Approximate word count | | `length_bin` | str | short / medium / long / very_long | | `score` | int | Reddit score at collection time | | `created_utc` | int | Unix timestamp | | `id` | str | Reddit post ID | ## Citation Data originally collected by Pushshift / u/raiderbdev, packaged by u/Watchful1.
提供机构:
validname
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作