aadityaubhat/GPT-wiki-intro
收藏Hugging Face2023-10-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/aadityaubhat/GPT-wiki-intro
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
task_categories:
- text-classification
- zero-shot-classification
- text-generation
pretty_name: GPT Wiki Intro
size_categories:
- 100K<n<1M
language:
- en
---
# GPT Wiki Intro
## Overview
Dataset for training models to classify human written vs GPT/ChatGPT generated text.
This dataset contains Wikipedia introductions and GPT (Curie) generated introductions for 150k topics.
Prompt used for generating text
```
200 word wikipedia style introduction on '{title}'
{starter_text}
```
where `title` is the title for the wikipedia page, and `starter_text` is the first seven words of the wikipedia introduction.
Here's an example of prompt used to generate the introduction paragraph for 'Secretory protein' -
>'200 word wikipedia style introduction on Secretory protein
>
> A secretory protein is any protein, whether'
Configuration used for GPT model
```
model="text-curie-001",
prompt=prompt,
temperature=0.7,
max_tokens=300,
top_p=1,
frequency_penalty=0.4,
presence_penalty=0.1
```
Schema for the dataset
|Column |Datatype|Description |
|---------------------|--------|-------------------------------------------|
|id |int64 |ID |
|url |string |Wikipedia URL |
|title |string |Title |
|wiki_intro |string |Introduction paragraph from wikipedia |
|generated_intro |string |Introduction generated by GPT (Curie) model|
|title_len |int64 |Number of words in title |
|wiki_intro_len |int64 |Number of words in wiki_intro |
|generated_intro_len |int64 |Number of words in generated_intro |
|prompt |string |Prompt used to generate intro |
|generated_text |string |Text continued after the prompt |
|prompt_tokens |int64 |Number of tokens in the prompt |
|generated_text_tokens|int64 |Number of tokens in generated text |
## Credits
* [wikipedia dataset](https://huggingface.co/datasets/wikipedia#licensing-information)
## Code
Code to create this dataset can be found on [GitHub](https://github.com/aadityaubhat/wiki_gpt)
## Citation
```
@misc {aaditya_bhat_2023,
author = { {Aaditya Bhat} },
title = { GPT-wiki-intro (Revision 0e458f5) },
year = 2023,
url = { https://huggingface.co/datasets/aadityaubhat/GPT-wiki-intro },
doi = { 10.57967/hf/0326 },
publisher = { Hugging Face }
}
```
提供机构:
aadityaubhat
原始信息汇总
GPT Wiki Intro 数据集概述
数据集描述
-
目的: 用于训练模型区分人类编写文本与GPT/ChatGPT生成文本。
-
内容: 包含150k个主题的维基百科介绍和GPT(Curie)生成的介绍。
-
生成文本的提示:
200 word wikipedia style introduction on {title} {starter_text}
其中
title是维基百科页面的标题,starter_text是维基百科介绍的前七个字。
数据集特征
- 语言: 英语
- 大小: 100K<n<1M
- 任务类别:
- 文本分类
- 零样本分类
- 文本生成
数据集结构
| 列名 | 数据类型 | 描述 |
|---|---|---|
| id | int64 | ID |
| url | string | 维基百科URL |
| title | string | 标题 |
| wiki_intro | string | 维基百科介绍段落 |
| generated_intro | string | GPT(Curie)模型生成的介绍 |
| title_len | int64 | 标题中的单词数 |
| wiki_intro_len | int64 | wiki_intro中的单词数 |
| generated_intro_len | int64 | generated_intro中的单词数 |
| prompt | string | 生成介绍所用的提示 |
| generated_text | string | 提示后的文本 |
| prompt_tokens | int64 | 提示中的令牌数 |
| generated_text_tokens | int64 | 生成文本中的令牌数 |
许可证
- 类型: CC



