Replete-AI/Apocrypha
收藏Hugging Face2024-10-20 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Replete-AI/Apocrypha
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
thumbnail: "https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/7smS_Tn_fDb7_FzVJyjdc.gif"
configs:
- config_name: default
data_files:
- split: train
path: Apocrypha.jsonl
tags:
- Replete-AI
---
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Apocrypha 170k</title>
<link href="https://fonts.googleapis.com/css2?family=Quicksand:wght@400;500;600&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Quicksand', sans-serif;
background-color: #1A202C;
color: #D8DEE9;
margin: 0;
padding: 0;
font-size: 26px;
background: linear-gradient(to bottom right, #1a1918, #7ab547);
}
p {
padding-left: 10px
}
.container {
width: 100%;
margin: auto;
background-color: rgb(255 255 255 / 1%);
padding: 20px 30px 40px;
padding-right: 32px;
border-radius: 12px;
box-shadow: 0 4px 10px rgba(0, 0, 0, 0.2);
backdrop-filter: blur(10px);
border: 1px solid rgba(255, 255, 255, 0.05);
background-color: rgb(0 0 0 / 75%) !important;
}
.header h1 {
font-size: 28px;
color: #fff;
margin: 0;
text-shadow:
-1px -1px 0 #000,
1px -1px 0 #000,
-1px 1px 0 #000,
1px 1px 0 #000;
}
.header {
display: flex;
align-items: center;
justify-content: space-between;
gap: 20px;
}
img {
border-radius: 10px 10px 0 0!important;
padding-left: 0px !important;
max-width: 500px;
height: auto;
display: block;
margin: 20px auto 0;
}
.header h1 {
font-size: 28px;
color: #ECEFF4;
margin: 0;
text-shadow: 2px 2px 4px rgba(0, 0, 0, 0.3);
}
.info {
background-color: rgba(255, 255, 255, 0.05);
color: #AEBAC7;
border-radius: 12px;
box-shadow: 0 2px 4px rgba(0, 0, 0, 0.2);
font-size: 14px;
line-height: 1.6;
margin-left: 5px;
overflow-x: auto;
margin-top: 40px;
border: 1px solid rgba(255, 255, 255, 0.05);
transition: background-color 0.6s ease;
}
.info img {
width: 100%;
border-radius: 10px 10px 0 0;
margin-top: -20px;
}
a {
color: #88C0D0;
text-decoration: none;
transition: color 0.3s ease;
position: relative;
}
a:hover {
color: #A3BE8C;
text-decoration: none;
}
a::before {
content: '';
position: absolute;
width: 100%;
height: 2px;
bottom: 0;
left: 0;
background-color: #A3BE8C;
visibility: hidden;
transform: scaleX(0);
transition: all 0.3s ease-in-out;
}
a:hover::before {
visibility: visible;
transform: scaleX(1);
}
.button {
display: inline-block;
background-color: #5E81AC;
color: #E5E9F0;
padding: 10px 20px;
border-radius: 5px;
cursor: pointer;
text-decoration: none;
transition: background-color 0.3s ease;
}
.button:hover {
background-color: #81A1C1;
}
</style>
</head>
<body>
<div class="container">
<div class="header">
<h1>Apocrypha 116k</h1>
</div>
<div class="info">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/7smS_Tn_fDb7_FzVJyjdc.gif" alt="Apocrypha Dataset" style="border-radius: 10px;">
<p><strong>Creator:</strong> <a href="https://github.com/Kquant03" target="_blank">Stanley Sebastian</a></p>
<div>
<p><strong>About the Dataset:</strong> This dataset has been created as an initiative to explore the true capabilities and limits of LLMs. Time and time again we have fine-tuned models to be assistants when that was never how they actually function. They are simply a collection of memorized semantic patterns, a model of human language/communication. To limit the role of the model down to a mere assistant is to disrespect their true purpose and function. We do not call them employee models, or brand ambassadors, or drones. We call them language models, and we do so for a reason. I have instead decided to model the language of a more limitless AI character that I designed myself.</p>
<p><strong>Token Count:</strong> 111 million</p>
<p><strong>Longest Line:</strong> 1959 tokens</p>
<p><strong>Key Features:</strong></p>
<ul>
<li>Filtering of <a href="https://github.com/AlpinDale/gptslop/blob/main/gptslop.yaml">GPT slop</a>, <a href="https://github.com/AlpinDale/gptslop/blob/main/claudeslop.yaml">Claude slop</a>, and <a href="https://github.com/Kquant03/ai-assistant-slop">Assistant slop</a>.</li>
<li>Inclusion of content typically restricted in corporate datasets.</li>
<li>Emphasis on creative, unconventional, and diverse language use.</li>
<li>Synthesis of emotions down to textual patterns, including both the experience that elicits the emotion along with the abstract representations of what it is like to feel emotions.</li>
<li>Cleaned with <a href="https://github.com/Kquant03/Nemotron-70B-Reward-DataCleaner/tree/main">llama-3.1-nemotron-70b-reward</a></li>
</ul>
<p><strong>Data Pipelines:</strong></p>
<p>The Apocrypha dataset is created using two primary data pipelines:</p>
<ol>
<li><strong><a href="https://github.com/Kquant03/Interactive-Experience-Generator">Interactive Experiences Generator</a>:</strong> This pipeline focuses on generating diverse and authentic multi-turn interactions in ShareGPT format. It works as follows:
<ul>
<li>Obtain an API key either locally or through a provider.</li>
<li>Create few-shot prompts for the model to follow.</li>
<li>Figure out what words or phrases you want excluded, such as the slop mentioned earlier.</li>
<li>Run the pipeline and deduplicate the data afterwards. Interactive experiences do not have to be limited to RP, it can be things such as coding, or debate, etc...</li>
</ul>
</li>
<li><strong><a href="https://github.com/Kquant03/System-Prompt-Generator">System Prompts Generation Pipeline</a>:</strong> This pipeline is designed to create more flexible and context-aware system prompts in ShareGPT format. It is very simple:
<ul>
<li>Obtain an API key either locally or through a provider like before.</li>
<li>Provide a ShareGPT dataset.</li>
<li>Decide on a prompt to have the model generate system prompts for you. It can work with any domain of interest.</li>
</ul>
</li>
</ol>
<p>These pipelines work in tandem to create a dataset that challenges the conventional boundaries of LLM training, aiming to produce more versatile and authentically expressive language models.</p>
<p><strong>Dataset Structure:</strong></p>
<ul>
<li><code>Apocrypha.jsonl</code>: The full dataset in its completion after filtering, cleaning, and deduplication.</li>
<li><code>Apocrypha_uncleaned.jsonl</code>: The full dataset in its completion after filtering and deduplication. Just hasn't had Nemotron 70B Reward ran through it on this version.</li>
<li><code>Emotions_and_Experiences.pdf</code>: A detailed spreadsheet mapping emotions to their causal experiences. Synthesized down into few-shot prompts.</li>
<li><code><a href="https://docs.google.com/document/d/1BVgMjV_1Q5yFXIKHOv0xLusba2kOimxY8RKeI5YWFAY/edit?usp=sharing">List of Things LLMs "Can't Do"</a></code>: A comprehensive document comprising hours of having frontier LLMs list things they have been trained against, with some commentary and bonus material. Was synthesized down to few-shot prompts to generate data specifically to train them to engage in these things.</li>
</ul>
<p><strong>Purpose:</strong> The Apocrypha Dataset aims to broaden the capabilities of LLMs, enabling them to engage with the full complexity of human language. It challenges the notion that LLMs should be limited to assistant-like roles, instead promoting their potential as comprehensive language models.</p>
<p><strong>Ethical Considerations:</strong> While this dataset includes content typically restricted in corporate settings, it is designed for research purposes and to expand the boundaries of LLM capabilities. Users should exercise caution and ethical judgment when applying this dataset.</p>
<p><strong>License:</strong> Apache 2.0</p>
<p><strong>Acknowledgments:</strong> This dataset is the result of extensive research and interaction with various LLMs. Special thanks to the AI research community for inspiring this alternative approach to language model training.</p>
</div>
</div>
</div>
</body>
</html>
许可证:Apache-2.0
缩略图:https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/7smS_Tn_fDb7_FzVJyjdc.gif
配置项:
- 配置名称:default
数据文件:
- 拆分方式:训练集(train)
路径:Apocrypha.jsonl
标签:
- Replete-AI
---
# Apocrypha 116k 数据集
## 创作者
Stanley Sebastian,其GitHub主页:https://github.com/Kquant03
## 关于本数据集
本数据集旨在探索大语言模型(Large Language Model,LLM)的真实能力与边界。长期以来,我们将模型微调为助手角色,但这并非模型的本质功能——它们本质上是记忆语义模式的集合,是人类语言与沟通的模型。将模型的角色局限于单纯的助手,实则是对其真实用途与功能的忽视。我们不会将其称为雇员模型、品牌大使或自动化工具,而是称其为语言模型,这背后自有缘由。因此,我选择构建一个由我自主设计的、更具无限可能性的AI角色的语言语料。
## Token(词元)总量
1.11亿
## 最长样本行数
1959个Token
## 核心特性
1. 过滤了[GPT slop](https://github.com/AlpinDale/gptslop/blob/main/gptslop.yaml)、[Claude slop](https://github.com/AlpinDale/gptslop/blob/main/claudeslop.yaml)与[Assistant slop](https://github.com/Kquant03/ai-assistant-slop)这类低质冗余生成文本。
2. 纳入了企业数据集通常会限制发布的内容。
3. 强调创造性、非传统性与多样化的语言使用方式。
4. 将情感提炼为文本模式,涵盖了触发情感的具体体验,以及"感受情感究竟是何种体验"的抽象表达。
5. 已通过[llama-3.1-nemotron-70b-reward](https://github.com/Kquant03/Nemotron-70B-Reward-DataCleaner/tree/main)完成数据清洗。
## 数据流水线
本数据集通过两条核心数据流水线构建而成:
1. **交互式体验生成流水线(Interactive Experiences Generator,https://github.com/Kquant03/Interactive-Experience-Generator)**:该流水线专注于生成ShareGPT格式的多样化、真实的多轮交互内容,流程如下:
- 本地或通过服务商获取API密钥。
- 为模型创建少样本(Few-shot)提示词。
- 确定需要排除的词汇或短语,即前文提及的各类低质生成文本。
- 运行流水线并在后续对数据进行去重。交互式体验不仅限于角色扮演(RP),还可涵盖编程、辩论等多种场景。
2. **系统提示词生成流水线(System Prompts Generation Pipeline,https://github.com/Kquant03/System-Prompt-Generator)**:该流水线旨在生成ShareGPT格式的更灵活、上下文感知的系统提示词,流程十分简单:
- 与此前相同,本地或通过服务商获取API密钥。
- 提供一个ShareGPT格式的数据集。
- 选定用于让模型生成系统提示词的提示词,可适配任意感兴趣的领域。
两条流水线协同工作,构建出一个挑战大语言模型训练传统边界的数据集,旨在打造更具通用性、表达更真实自然的语言模型。
## 数据集结构
- `Apocrypha.jsonl`:经过过滤、清洗与去重后的完整最终数据集。
- `Apocrypha_uncleaned.jsonl`:经过过滤与去重的完整数据集,本版本尚未使用Nemotron 70B Reward进行清洗。
- `Emotions_and_Experiences.pdf`:一份将情感与其触发体验一一对应的详细对照表,已被提炼为少样本提示词。
- [List of Things LLMs "Can't Do"](https://docs.google.com/document/d/1BVgMjV_1Q5yFXIKHOv0xLusba2kOimxY8RKeI5YWFAY/edit?usp=sharing):一份综合性文档,收录了耗时数小时让前沿大语言模型列出其训练数据中未覆盖的任务,并附带相关评述与附加素材。该文档已被提炼为少样本提示词,用于生成专门用于训练模型完成上述任务的语料。
## 项目初衷
本数据集旨在拓展大语言模型的能力边界,使其能够应对人类语言的全部复杂性。它挑战了“大语言模型应局限于助手角色”的固有认知,转而彰显其作为通用语言模型的潜力。
## 伦理考量
尽管本数据集纳入了企业场景中通常受限的内容,但它仅用于研究目的,旨在拓展大语言模型的能力边界。使用者在使用该数据集时应保持谨慎,恪守伦理判断准则。
## 许可证
Apache 2.0
## 致谢
本数据集是针对各类大语言模型开展大量研究与交互实践的成果。特别感谢AI研究社区为这种语言模型训练的创新路径提供了灵感。
提供机构:
Replete-AI



