tanishy7777/IITGN_GPT_dataset

Name: tanishy7777/IITGN_GPT_dataset
Creator: tanishy7777
Published: 2025-11-15 15:50:14
License: 暂无描述

Hugging Face2025-11-15 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/tanishy7777/IITGN_GPT_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification language: - en tags: - plagarism - humanvsai size_categories: - 100K<n<1M --- # Dataset Summary This dataset contains four classes of text samples designed for research on machine-generated text detection, stylometric transformations, and LLM-based text post-processing. Each text sample belongs to one of the following categories: `hw` — Human-written: sourced directly from the publicly available human-authored dataset on Zenodo. `mw` — Machine-written: generated by prompting LLMs with the first sentence of a human-written text. `hw_mp` — Human-written, machine-polished: human-written text polished by an LLM to improve clarity, coherence, or readability. `mw_mh` — Machine-written, machine-humanized: machine-written text further rewritten by an LLM to make it sound more human. Each transformed sample includes metadata identifying which model produced it. # Supported Tasks ## Primary Tasks 1. Machine-generated text detection (binary & multi-class) 2. LLM transformation classification # Dataset Structure ## Columns Column Description text The text sample (original or transformed). class One of hw, mw, hw_mp, mw_mh. model The model used for generation/transformation. For hw, the value is "human". ## Classes `hw`: Human-written `mw`: Machine-written `hw_mp`: Human-written → Machine-polished `mw_mh`: Machine-written → Machine-humanized # Models Used Machine-based samples come from four open-source or research LLMs: `meta-llama/Llama-2-7b-chat-hf` `CohereLabs/aya-expanse-8b` `tiiuae/Falcon3-7B-Instruct` `Mistral-7B-Instruct-v0.3` For `hw`, the model value is `"human"`. # Data Sources Human-written data Human data was sourced from: Zenodo Record: https://zenodo.org/records/8285326 Source human texts include 60k diverse human-written samples, which are randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019. This serves as the base for all transformations to create the mw, hw_mp and mw_mp samples. # Generated Data Pipeline 1. Machine-Written (mw) Input: First sentence of a human-written text Process: Passed to each LLM to generate a continuation Output: A pure machine-generated article or paragraph 2. Human-Written Machine-Polished (hw_mp) Input: Full human-written text Prompt: A polishing/editing instruction (e.g., clarify, improve coherence) Output: A rewritten text preserving meaning but adopting LLM stylistic patterns 3. Machine-Written Machine-Humanized (mw_mh) Input: The mw machine-written text Prompt: a prompt to humanize the text / make it more natural and human-like Output: A deliberately human-like machine rewrite # Ethical Considerations 1. Misuse Risk: Text Origin Surveillance Machine-generated text detection should not be used to: - Penalize students, writers, or employees - Infer authorship in sensitive or personal contexts - Classifiers trained on this dataset may produce false positives, especially for polished or humanized text. 2. Synthetic Data Disclosure Non-human classes (mw, hw_mp, mw_mh) are synthetically generated using LLMs. They carry: - LLM stylistic biases - Model-specific phrasing patterns - Risk of hallucinations or misleading content 3. Licensing Compliance The human data source follows the license provided in the Zenodo dataset. Users must ensure compliance with: - Zenodo dataset license - Licenses attached to the LLMs used for generation 4. Bias & Representativeness Because LLMs reflect their training data: - Styles may favor Western writing norms - Outputs may show socioeconomic, cultural, or topical biases - Not suitable for claims about global writing styles # Dataset Creation Human Data Preprocessing No semantic modifications Minor cleaning (if applied, describe) ## Model Prompts (TODO) MW generation prompt HW_MP polishing prompt MW_MH humanization prompt # Dataset Size Total samples: 115,426 generated samples Class distribution: hw: 0 samples (please use the human written samples from zenode) mw: 44,429 hw_mp: 51,754 mw_mh: 19,243 Model-wise distribution: ![image](https://cdn-uploads.huggingface.co/production/uploads/68eaad45444c04bdce464d1f/IarHJjOblnDzh4gB84VGh.png) # Licensing Licensed under the MIT License # Citation If used in academic or research work, please cite: @dataset{Tanish_Yelgoe_2025, title = {IITGNGPT: Human–Machine Generated & Transformed Text Dataset}, author = {Tanish Yelgoe}, year = {2025}, publisher = {Hugging Face Datasets}, howpublished = {\url{https://huggingface.co/datasets/IITGN_GPT_dataset}} } Also cite the Zenodo source and the models used. # Contact For questions, collaborations, or issues: GitHub: https://github.com/tanishy7777 Email: tanish.yelgoe@iitgn.ac.in

提供机构：

tanishy7777

5,000+

优质数据集

54 个

任务类型

进入经典数据集