tanishy7777/IITGN_GPT_dataset
收藏Hugging Face2025-11-15 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tanishy7777/IITGN_GPT_dataset
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-classification
language:
- en
tags:
- plagarism
- humanvsai
size_categories:
- 100K<n<1M
---
# Dataset Summary
This dataset contains four classes of text samples designed for research on machine-generated text detection, stylometric transformations, and LLM-based text post-processing.
Each text sample belongs to one of the following categories:
`hw` — Human-written: sourced directly from the publicly available human-authored dataset on Zenodo.
`mw` — Machine-written: generated by prompting LLMs with the first sentence of a human-written text.
`hw_mp` — Human-written, machine-polished: human-written text polished by an LLM to improve clarity, coherence, or readability.
`mw_mh` — Machine-written, machine-humanized: machine-written text further rewritten by an LLM to make it sound more human.
Each transformed sample includes metadata identifying which model produced it.
# Supported Tasks
## Primary Tasks
1. Machine-generated text detection (binary & multi-class)
2. LLM transformation classification
# Dataset Structure
## Columns
Column Description
text The text sample (original or transformed).
class One of hw, mw, hw_mp, mw_mh.
model The model used for generation/transformation. For hw, the value is "human".
## Classes
`hw`: Human-written
`mw`: Machine-written
`hw_mp`: Human-written → Machine-polished
`mw_mh`: Machine-written → Machine-humanized
# Models Used
Machine-based samples come from four open-source or research LLMs:
`meta-llama/Llama-2-7b-chat-hf`
`CohereLabs/aya-expanse-8b`
`tiiuae/Falcon3-7B-Instruct`
`Mistral-7B-Instruct-v0.3`
For `hw`, the model value is `"human"`.
# Data Sources
Human-written data
Human data was sourced from:
Zenodo Record: https://zenodo.org/records/8285326
Source human texts include 60k diverse human-written samples, which are randomly selected from the OpenWebText dataset.
These entries are collected from the user generated content from Reddit before 2019. This serves as the base for all transformations
to create the mw, hw_mp and mw_mp samples.
# Generated Data Pipeline
1. Machine-Written (mw)
Input: First sentence of a human-written text
Process: Passed to each LLM to generate a continuation
Output: A pure machine-generated article or paragraph
2. Human-Written Machine-Polished (hw_mp)
Input: Full human-written text
Prompt: A polishing/editing instruction (e.g., clarify, improve coherence)
Output: A rewritten text preserving meaning but adopting LLM stylistic patterns
3. Machine-Written Machine-Humanized (mw_mh)
Input: The mw machine-written text
Prompt: a prompt to humanize the text / make it more natural and human-like
Output: A deliberately human-like machine rewrite
# Ethical Considerations
1. Misuse Risk: Text Origin Surveillance
Machine-generated text detection should not be used to:
- Penalize students, writers, or employees
- Infer authorship in sensitive or personal contexts
- Classifiers trained on this dataset may produce false positives, especially for polished or humanized text.
2. Synthetic Data Disclosure
Non-human classes (mw, hw_mp, mw_mh) are synthetically generated using LLMs.
They carry:
- LLM stylistic biases
- Model-specific phrasing patterns
- Risk of hallucinations or misleading content
3. Licensing Compliance
The human data source follows the license provided in the Zenodo dataset.
Users must ensure compliance with:
- Zenodo dataset license
- Licenses attached to the LLMs used for generation
4. Bias & Representativeness
Because LLMs reflect their training data:
- Styles may favor Western writing norms
- Outputs may show socioeconomic, cultural, or topical biases
- Not suitable for claims about global writing styles
# Dataset Creation
Human Data Preprocessing
No semantic modifications
Minor cleaning (if applied, describe)
## Model Prompts
(TODO)
MW generation prompt
HW_MP polishing prompt
MW_MH humanization prompt
# Dataset Size
Total samples: 115,426 generated samples
Class distribution:
hw: 0 samples (please use the human written samples from zenode)
mw: 44,429
hw_mp: 51,754
mw_mh: 19,243
Model-wise distribution:

# Licensing
Licensed under the MIT License
# Citation
If used in academic or research work, please cite:
@dataset{Tanish_Yelgoe_2025,
title = {IITGNGPT: Human–Machine Generated & Transformed Text Dataset},
author = {Tanish Yelgoe},
year = {2025},
publisher = {Hugging Face Datasets},
howpublished = {\url{https://huggingface.co/datasets/IITGN_GPT_dataset}}
}
Also cite the Zenodo source and the models used.
# Contact
For questions, collaborations, or issues:
GitHub: https://github.com/tanishy7777
Email: tanish.yelgoe@iitgn.ac.in
提供机构:
tanishy7777



