hannxu/hc_var

Name: hannxu/hc_var
Creator: hannxu
Published: 2023-10-03 16:33:15
License: 暂无描述

Hugging Face2023-10-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/hannxu/hc_var

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-classification language: - en size_categories: - 100M<n<1B --- # Dataset Card for HC-Var (Human and ChatGPT Texts with Variety) This is a collection of human texts and ChatGPT (GPT3.5-Turbo) generated texts, to faciliate studies such as generated texts detection. It includes the texts which are generated / human written to accomplish various language tasks with various approaches. The included language tasks and topics are summarized below. Note: For each language task, this dataset considers 3 different prompts to inquire ChatGPT outputs. The example code to train binary classification models is in [this website](https://github.com/hannxu123/hc_var). A technical report on some representative detection methods can be find in [this paper](https://arxiv.org/abs/2310.01307). This dataset is collected by Han Xu from Michigan State University. Potential issues and suggestions are welcomed to be dicussed in the community panel or emails to xuhan1@msu.edu. ## Key variables in the dataset: **text**: The text body (including either human or ChatGPT texts.)\ **domain**: The language tasks included in this dataset: News, Review, (Essay) Writing, QA\ **topic**: The topic in each task.\ **prompt**: The prompt used to obtain ChatGPT outputs. "N/A" for human texts.\ **pp_id**: Each task has 3 prompts to inquire ChatGPT outputs. The "pp_id" denotes the index of prompt. "0" for human texts. "1-3" for ChatGPT texts.\ **label**: "0" for human texts. "1" for ChatGPT texts. ## To cite this dataset ``` @misc{xu2023generalization, title={On the Generalization of Training-based ChatGPT Detection Methods}, author={Han Xu and Jie Ren and Pengfei He and Shenglai Zeng and Yingqian Cui and Amy Liu and Hui Liu and Jiliang Tang}, year={2023}, eprint={2310.01307}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

hannxu

原始信息汇总

数据集卡片 for HC-Var (Human and ChatGPT Texts with Variety)

概述

HC-Var 数据集是一个包含人类文本和 ChatGPT (GPT3.5-Turbo) 生成文本的集合，旨在促进生成文本检测等研究。该数据集包括用于完成各种语言任务的生成文本和人类书写文本，涵盖多种语言任务和主题。

语言任务和主题

数据集包含以下语言任务和主题：

任务: 新闻、评论、（论文）写作、问答
主题: 每个任务的具体主题

关键变量

text: 文本内容（包括人类或 ChatGPT 文本）
domain: 数据集包含的语言任务：新闻、评论、（论文）写作、问答
topic: 每个任务中的主题
prompt: 用于获取 ChatGPT 输出的提示。人类文本为 "N/A"
pp_id: 每个任务有 3 个提示用于获取 ChatGPT 输出。"pp_id" 表示提示的索引。人类文本为 "0"，ChatGPT 文本为 "1-3"
label: 人类文本为 "0"，ChatGPT 文本为 "1"

引用

@misc{xu2023generalization, title={On the Generalization of Training-based ChatGPT Detection Methods}, author={Han Xu and Jie Ren and Pengfei He and Shenglai Zeng and Yingqian Cui and Amy Liu and Hui Liu and Jiliang Tang}, year={2023}, eprint={2310.01307}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集