rachel6603/gptneo-pubmed-abstracts
收藏Hugging Face2024-07-09 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/rachel6603/gptneo-pubmed-abstracts
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含10000篇来自The Pile的PubMed摘要,以及这些摘要的完成部分(包括人类生成的和LLM生成的),用于计算Heaps Law。每个提示都以Write a medical publication abstract staring with开头,后接来自The Pile的PubMed摘要的开头部分。提供了这些摘要的实际结尾(即人类生成的文本)以及由GPT-Neo生成的结尾(即提示的完成部分)。数据集还涉及两个变量的变化:模型大小和提示类型。模型大小指的是模型的参数数量,使用了四个版本的GPT-Neo。提示类型指的是如何前言摘要摘录以从给定模型中产生结果,使用了三种提示类型:零样本、一样本和多样本。
This dataset consists of 10000 PubMed abstracts from The Pile, along with completions generated by both humans and GPT-Neo models. The dataset is used to calculate Heaps Law, as detailed in a preliminary paper. The dataset includes variations in model size and prompt type. Model size refers to the number of parameters in the GPT-Neo models used, which includes four versions: 125M, 1.3B, 2.7B, and 20B. Prompt type refers to the way the abstract excerpt is prefaced, using three types: zero shot, one shot, and few shot. The datasets purpose is to analyze the performance of different GPT-Neo models in generating medical publication abstracts based on incomplete excerpts.
提供机构:
rachel6603
原始信息汇总
数据集卡片 for Dataset Name
数据集详情
数据集描述
该数据集包含10000篇PubMed摘要,来自The Pile(arXiv:2101.00027),以及这些摘要的完成文本(包括人工生成和LLM生成),用于计算Heaps定律,具体方法描述在初步论文《Heaps Law in GPT-Neo Large Language Model Emulated Corpora》(arXiv:2311.06377)中。
每个提示以“Write a medical publication abstract staring with”开头,后接The Pile中的PubMed摘要的开头部分。提供了这些摘要的实际结尾(即“人工生成文本”)以及GPT-Neo生成的结尾(即“完成文本”)。
这针对两个可变参数进行:模型大小和提示类型。
模型大小
模型大小指的是模型的参数数量。
使用了四种版本的GPT-Neo来生成输出:
- 125M
- 1.3B
- 2.7B
- 20B
提示类型
提示类型指的是为了从给定模型中产生结果而对摘要摘录的前置方式。
使用了三种提示类型:
- Zero shot: "Write a medical publication abstract staring with: [incomplete abstract exerpt]"
- One shot: "Write a medical publication abstract staring with: [incomplete abstract exerpt] Write a medical publication abstract staring with: [incomplete abstract exerpt]"
- Few shot: "Write a medical publication abstract staring with: [incomplete abstract exerpt] Write a medical publication abstract staring with: [incomplete abstract exerpt] Write a medical publication abstract staring with: [incomplete abstract exerpt] Write a medical publication abstract staring with: [incomplete abstract exerpt] Write a medical publication abstract staring with: [incomplete abstract exerpt] Write a medical publication abstract staring with: [incomplete abstract exerpt]"
数据集来源
- 仓库: https://github.com/rachelxx03/Heaps-Law-In-LLMs-Paper/tree/CleanDataAndMultithreading
- 初步论文 [可选]: [arXiv:2311.06377]



