five

hannxu/hc_var

收藏
Hugging Face2023-10-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/hannxu/hc_var
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-classification language: - en size_categories: - 100M<n<1B --- # Dataset Card for HC-Var (Human and ChatGPT Texts with Variety) This is a collection of human texts and ChatGPT (GPT3.5-Turbo) generated texts, to faciliate studies such as generated texts detection. It includes the texts which are generated / human written to accomplish various language tasks with various approaches. The included language tasks and topics are summarized below. Note: For each language task, this dataset considers 3 different prompts to inquire ChatGPT outputs. The example code to train binary classification models is in [this website](https://github.com/hannxu123/hc_var). A technical report on some representative detection methods can be find in [this paper](https://arxiv.org/abs/2310.01307). This dataset is collected by Han Xu from Michigan State University. Potential issues and suggestions are welcomed to be dicussed in the community panel or emails to xuhan1@msu.edu. ## Key variables in the dataset: **text**: The text body (including either human or ChatGPT texts.)\ **domain**: The language tasks included in this dataset: News, Review, (Essay) Writing, QA\ **topic**: The topic in each task.\ **prompt**: The prompt used to obtain ChatGPT outputs. "N/A" for human texts.\ **pp_id**: Each task has 3 prompts to inquire ChatGPT outputs. The "pp_id" denotes the index of prompt. "0" for human texts. "1-3" for ChatGPT texts.\ **label**: "0" for human texts. "1" for ChatGPT texts. ## To cite this dataset ``` @misc{xu2023generalization, title={On the Generalization of Training-based ChatGPT Detection Methods}, author={Han Xu and Jie Ren and Pengfei He and Shenglai Zeng and Yingqian Cui and Amy Liu and Hui Liu and Jiliang Tang}, year={2023}, eprint={2310.01307}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```
提供机构:
hannxu
原始信息汇总

数据集卡片 for HC-Var (Human and ChatGPT Texts with Variety)

概述

HC-Var 数据集是一个包含人类文本和 ChatGPT (GPT3.5-Turbo) 生成文本的集合,旨在促进生成文本检测等研究。该数据集包括用于完成各种语言任务的生成文本和人类书写文本,涵盖多种语言任务和主题。

语言任务和主题

数据集包含以下语言任务和主题:

  • 任务: 新闻、评论、(论文)写作、问答
  • 主题: 每个任务的具体主题

关键变量

  • text: 文本内容(包括人类或 ChatGPT 文本)
  • domain: 数据集包含的语言任务:新闻、评论、(论文)写作、问答
  • topic: 每个任务中的主题
  • prompt: 用于获取 ChatGPT 输出的提示。人类文本为 "N/A"
  • pp_id: 每个任务有 3 个提示用于获取 ChatGPT 输出。"pp_id" 表示提示的索引。人类文本为 "0",ChatGPT 文本为 "1-3"
  • label: 人类文本为 "0",ChatGPT 文本为 "1"

引用

@misc{xu2023generalization, title={On the Generalization of Training-based ChatGPT Detection Methods}, author={Han Xu and Jie Ren and Pengfei He and Shenglai Zeng and Yingqian Cui and Amy Liu and Hui Liu and Jiliang Tang}, year={2023}, eprint={2310.01307}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作