LinguisticFootprintsOfChatGPT
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11109704
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was produced for the research paper "Tracing Linguistic Footprints of ChatGPT Across Tasks, Domains and Personas in English and German." The project explores how the output of large language models like ChatGPT differs from human-generated text and analyzes the impact of task-specific prompting on linguistic features in both English and German texts.
The dataset contains human-written files collected from a number of publicly available datasets as well as their counterparts, generated by ChatGPT. The human data comes from the following corpora:
E3C: The European Clinical Case Corpus (Minard et al., 2021)
GGPONC: The German Guideline Program in Oncology NLP Corpus (Borchert et al., 2022)
20 Minuten: articles from a free Swiss daily newspaper (Kew et al., 2023)
CNN: articles (Hermann et al., 2015)
CSB: The Credit Suisse Bulletin corpus (Volk et al., 2016)
Additionally, more original human texts were collected from the PubMed Central Database and The Zurich Open Repository and Archive.
The generated texts were produced by ChatGPT-3 under 3 distinct tasks, to continue generation, to explain text, and to create a new text. Depending on the task, the prompts contained different sections of the original human text. The completion and creation tasks processed the title and the 1st paragraph. For the explanation task the model was provided with the main part of the text.
For more information see our paper at tbd
Code: https://github.com/shaitarAn/LinguisticFootprintsChatGPT
创建时间:
2024-05-03



