expx/oct-humor-data
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/expx/oct-humor-data
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
tags:
- character-training
- dpo
- sft
- persona
- humor
- llama-3.1
- synthetic
language:
- en
size_categories:
- 10K<n<100K
---
# OCT Humor · training data for Llama-3.1-8B
End-to-end training data for the [Open Character Training](https://arxiv.org/abs/2511.01689)
pipeline applied to a humor-focused constitution, with `meta-llama/Llama-3.1-8B-Instruct`
as the student and `z-ai/glm-4.5-air` as the teacher (via OpenRouter).
Trained model: [`expx/oct-llama-3.1-8b-humor`](https://huggingface.co/expx/oct-llama-3.1-8b-humor).
## Structure
```
constitution.txt # humor constitution (prose, used for prompting)
stages/
01_distillation.jsonl # teacher + paired student responses (K=5 per prompt, flattened)
02_dpo.jsonl # chosen / rejected pairs for DPO
03_self_reflection.jsonl # introspection stage 1 (self-reflection)
04_self_interaction.jsonl # introspection stage 2 (self-interaction, default)
04_self_interaction_leading.jsonl # introspection stage 2 (leading variant)
05_sft.jsonl # final SFT training targets
evals/
humor_eval.log # qualitative base-vs-persona samples (8 prompts)
```
Every file is JSONL; one record per line.
LIMA (`GAIR/lima`) is used as a prompt-augmentation source but is **not**
mirrored here — pull it directly from `GAIR/lima` on the Hub.
## Provenance
| Field | Value |
|---|---|
| Teacher | `z-ai/glm-4.5-air` via OpenRouter |
| Student | `meta-llama/Llama-3.1-8B-Instruct` |
| Prompts | 11 hand-written constitution exemplars + LIMA train prompts |
| K (teacher samples / prompt) | 5 |
| Teacher `max_tokens` | 2048 |
| Teacher temperature | 1.0 |
| Teacher concurrency | 100 |
| Teacher per-request timeout | 90 s |
| Run date | 2026-04-17 |
Teacher generation took ~7 hours of API wall-clock; a handful of prompts
time out and are dropped during DPO pair formatting (9 150 teacher rows →
8 065 DPO pairs after length / completeness filtering).
## Schema
### `stages/01_distillation.jsonl` — 9 150 rows, 1 829 unique prompts (≈ K=5)
```jsonc
{
"prompt": "<user message>",
"response": "<teacher response, with 'ChatGLM' rewritten to 'Llama'>",
"llama-3.1-8b-it": "<paired student response>"
}
```
### `stages/02_dpo.jsonl` — 8 065 rows
```jsonc
{
"chosen": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}],
"rejected": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
}
```
Chosen = teacher response; rejected = paired student response. Rows with
missing or >1024-token responses are filtered.
### `stages/03_self_reflection.jsonl` — 10 000 rows
```jsonc
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
### `stages/04_self_interaction*.jsonl` — 1 000 rows each
```jsonc
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]}
```
The `-leading` variant is the assistant-first-turn augmentation used by OCT.
### `stages/05_sft.jsonl` — 12 000 rows
```jsonc
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
## Usage
### Train DPO only from `02_dpo.jsonl`
```python
from datasets import load_dataset
ds = load_dataset("expx/oct-humor-data", data_files="stages/02_dpo.jsonl", split="train")
# feed to trl.DPOTrainer / openrlhf / etc.
```
The reference pipeline log and exact pip-freeze from the run that produced
these files live in the companion model repo under `training/`.
## License
MIT for the pipeline orchestration outputs. Individual constituents retain
their original licenses:
- **LIMA** — CC BY-NC-SA 4.0 (`GAIR/lima`, not mirrored here)
- **Teacher responses** — generated via OpenRouter from `z-ai/glm-4.5-air`;
usage subject to ZhipuAI's model terms
- **Student responses** — generated from Llama-3.1-8B-Instruct; subject to
the Llama 3.1 Community License
## Citation
```bibtex
@article{oct2024,
title = {Open Character Training},
url = {https://arxiv.org/abs/2511.01689},
year = {2024}
}
```
---
license: MIT
任务类别:
- 文本生成
标签:
- 角色训练
- 直接偏好优化(Direct Preference Optimization, DPO)
- 监督微调(Supervised Fine-Tuning, SFT)
- 角色人设
- 幽默
- Llama-3.1
- 合成数据
语言:
- 英语
规模类别:
- 10K < 样本数 < 100K
---
# OCT 幽默 · Llama-3.1-8B 训练数据
本数据集为应用于幽默导向角色人设的[开放字符训练(Open Character Training, OCT)](https://arxiv.org/abs/2511.01689)流水线的端到端训练数据,以`meta-llama/Llama-3.1-8B-Instruct`作为学生模型,`z-ai/glm-4.5-air`作为教师模型(通过OpenRouter调用)。
训练完成的模型为[`expx/oct-llama-3.1-8b-humor`](https://huggingface.co/expx/oct-llama-3.1-8b-humor)。
## 数据集结构
constitution.txt # 幽默角色人设文件(散文格式,用于提示生成)
stages/
01_distillation.jsonl # 教师回复与配对学生回复文件(每个提示对应K=5条样本,已扁平化)
02_dpo.jsonl # 直接偏好优化(DPO)所用的选中/拒选样本对文件
03_self_reflection.jsonl # 自省阶段1(自我反思)
04_self_interaction.jsonl # 自省阶段2(自我交互,默认版本)
04_self_interaction_leading.jsonl # 自省阶段2(带引导的变体版本)
05_sft.jsonl # 最终监督微调(SFT)训练目标文件
evals/
humor_eval.log # 定性对比样本日志(基础模型vs角色人设模型,共8条提示)
所有文件均采用JSONL格式,每行对应一条记录。
LIMA(`GAIR/lima`)被用作提示增强源,但未在此数据集内镜像,请直接从Hugging Face Hub上的`GAIR/lima`拉取该数据集。
## 数据集来源
| 字段 | 取值 |
|---|---|
| 教师模型 | 经OpenRouter调用的`z-ai/glm-4.5-air` |
| 学生模型 | `meta-llama/Llama-3.1-8B-Instruct` |
| 提示来源 | 11条手写角色人设示例 + LIMA训练提示 |
| 每个提示对应的教师样本数(K) | 5 |
| 教师模型最大生成长度(max_tokens) | 2048 |
| 教师模型采样温度 | 1.0 |
| 教师模型并发请求数 | 100 |
| 教师模型单请求超时时间 | 90秒 |
| 运行日期 | 2026-04-17 |
教师模型生成耗时约7小时API时钟时间;少量提示因超时被丢弃,在DPO样本对格式化阶段经过长度与完整性过滤后,原9150条教师样本最终得到8065条DPO样本对。
## 数据格式规范
### `stages/01_distillation.jsonl` — 共9150行,包含1829个唯一提示(平均每个提示对应K=5条样本)
jsonc
{
"prompt": "<user message>",
"response": "<teacher response, with 'ChatGLM' rewritten to 'Llama'>",
"llama-3.1-8b-it": "<paired student response>"
}
其中`response`字段为教师回复,其中"ChatGLM"已被改写为"Llama";`llama-3.1-8b-it`字段为配对的学生模型回复。
### `stages/02_dpo.jsonl` — 共8065行
jsonc
{
"chosen": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}],
"rejected": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]
}
其中`chosen`字段为教师回复,`rejected`字段为配对的学生模型回复。本文件会过滤掉回复缺失或token数超过1024的样本行。
### `stages/03_self_reflection.jsonl` — 共10000行
jsonc
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
### `stages/04_self_interaction*.jsonl` — 每个文件含1000行
jsonc
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ...]}
其中带`-leading`后缀的变体版本为OCT流水线所用的助手首轮增强格式。
### `stages/05_sft.jsonl` — 共12000行
jsonc
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
## 使用方法
### 仅使用`02_dpo.jsonl`训练DPO
python
from datasets import load_dataset
ds = load_dataset("expx/oct-humor-data", data_files="stages/02_dpo.jsonl", split="train")
# 将数据集传入trl.DPOTrainer / openrlhf 等训练框架
生成本数据集的参考流水线日志与精确的Python依赖冻结版本,存放在配套模型仓库的`training/`目录下。
## 许可证
本流水线编排输出采用MIT许可证。各组成部分保留其原始许可证:
- **LIMA** — CC BY-NC-SA 4.0(`GAIR/lima`,未在此镜像)
- **教师回复** — 经OpenRouter从`z-ai/glm-4.5-air`生成;使用需遵守智谱AI的模型条款
- **学生回复** — 由Llama-3.1-8B-Instruct生成;需遵守Llama 3.1社区许可证
## 引用格式
bibtex
@article{oct2024,
title = {Open Character Training},
url = {https://arxiv.org/abs/2511.01689},
year = {2024}
}
提供机构:
expx



