Mharis205/ai-text-detection-pile
收藏Hugging Face2026-01-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Mharis205/ai-text-detection-pile
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
---
# Dataset Card for AI Text Dectection Pile
## Dataset Description
- **Point of Contact:artem9k@gmail.com
### Dataset Summary
This is a large scale dataset intended for AI Text Detection tasks, geared toward long-form text and essays. It contains samples of both human text and AI-generated text from GPT2, GPT3, ChatGPT, GPTJ.
Here is the (tentative) breakdown:
#### Human Text
| Dataset | Num Samples | Link |
| ----------- | ----------- | ----------- |
| Reddit WritingPromps | 570k | [Link](https://www.kaggle.com/datasets/ratthachat/writing-prompts) |
| OpenAI Webtext | 260k | [Link](https://github.com/openai/gpt-2-output-dataset) |
| HC3 (Human Responses) | 58k | [Link](https://huggingface.co/datasets/Hello-SimpleAI/HC3) |
| ivypanda-essays | TODO | TODO |
| **Total** | **990k** | **-** |
#### AI-Generated Text
| Model | Dataset | Num Samples | Link |
| ----------- | ----------- | ----------- | ----------- |
| GPT2 | OpenAI gpt2-output-dataset | 260k | [Link](https://github.com/openai/gpt-2-output-dataset) |
| GPT3 | pairwise-davinci | 44k | TODO |
| GPT3 | synthetic-instruct-davinci-pairwise | 30k | [Link](https://huggingface.co/datasets/Dahoas/instruct-synthetic-prompt-responses) |
| GPTJ | synthetic-instruct-gptj-pairwise | 44k | [Link](https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise) |
| ChatGPT | Scraped from twitter | 5k | **-** |
| ChatGPT | HC3 (ChatGPT Responses) | 27k | [Link](https://huggingface.co/datasets/Hello-SimpleAI/HC3) |
| ChatGPT | ChatGPT Prompts/emergentmind | 500 | [Link](https://huggingface.co/datasets/MohamedRashad/ChatGPT-prompts/tree/main) |
| **Total** | **340k** | **-** | **-** |
### Supported Tasks and Leaderboards
Text Classification, AI Text Detection.
### Languages
English.
### Data Fields
TEXT: The text of the sample.
SOURCE: either "human" or "ai"
提供机构:
Mharis205



