five

gutenberg-dpo-v0.1

收藏
魔搭社区2025-12-05 更新2024-09-28 收录
下载链接:
https://modelscope.cn/datasets/jondurbin/gutenberg-dpo-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
# Gutenberg DPO ![gutenberg](gutenberg.png) ## Overview This is a dataset meant to enhance novel writing capabilities of LLMs, by using public domain books from [Project Gutenberg](https://gutenberg.org/) ## Process First, the each book is parsed, split into chapters, cleaned up from the original format (remove superfluous newlines, illustration tags, etc.). Once we have chapters, an LLM is prompted with each chapter to create a synthetic prompt that would result in that chapter being written. Each chapter has a summary created as well, so that the prompts for each chapter after the also include a summary of the previous chapter to provide additional context. We then use the synthetic prompt with previous chapter summary to write the chapter with an LLM (llama-2-13b-chat, bagel-7b-v0.1, dolphin-2.2-34b). The human written text, that is, the original chapter, is used as the "chosen" value, and the LLM written chapter is used as the rejected value. ## Books used These books were chosen main because they appeared in the popular section on project gutenberg, and they function correctly with the chapterize library. - Huckleberry Finn - Treasure Island - Anna Karenina - Uncle Tom’s Cabin - Wuthering Heights - Madame Bovary - The Turn of the Screw - The War of the Worlds - A Study in Scarlet - Middlemarch - Pride and Prejudice - The Brothers Karamazov - Through the Looking Glass - Moby Dick - Frankenstein - A Tale of Two Cities

# 古腾堡DPO(Gutenberg DPO) ![gutenberg](gutenberg.png) ## 概述 本数据集旨在借助[古腾堡计划(Project Gutenberg)](https://gutenberg.org/)中的公有领域图书,提升大语言模型(LLM)的小说创作能力。 ## 处理流程 首先,对每本图书进行解析、拆分章节,并从原始格式中清理冗余内容(移除多余换行符、插图标签等)。 完成章节拆分后,针对每一章向大语言模型(LLM)发起提示,生成可用于创作该章节的合成提示词。同时为每一章生成章节摘要,使后续章节的提示词包含前一章的摘要,以提供额外的上下文信息。 随后,结合合成提示词与前一章摘要,通过大语言模型(llama-2-13b-chat、bagel-7b-v0.1、dolphin-2.2-34b)生成对应章节。将人工撰写的原始章节作为 "chosen" 样本,由大语言模型生成的章节作为 "rejected" 样本。 ## 选用图书 本次选用的图书主要来自古腾堡计划的热门分类板块,且均可通过chapterize库正确完成章节拆分。 - 《哈克贝利·费恩历险记》(Huckleberry Finn) - 《金银岛》(Treasure Island) - 《安娜·卡列尼娜》(Anna Karenina) - 《汤姆叔叔的小屋》(Uncle Tom’s Cabin) - 《呼啸山庄》(Wuthering Heights) - 《包法利夫人》(Madame Bovary) - 《螺丝在拧紧》(The Turn of the Screw) - 《世界大战》(The War of the Worlds) - 《血字的研究》(A Study in Scarlet) - 《米德尔马契》(Middlemarch) - 《傲慢与偏见》(Pride and Prejudice) - 《卡拉马佐夫兄弟》(The Brothers Karamazov) - 《爱丽丝镜中奇遇记》(Through the Looking Glass) - 《白鲸》(Moby Dick) - 《弗兰肯斯坦》(Frankenstein) - 《双城记》(A Tale of Two Cities)
提供机构:
maas
创建时间:
2025-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作