jondurbin/contextual-dpo-v0.1

Name: jondurbin/contextual-dpo-v0.1
Creator: jondurbin
Published: 2024-01-11 10:15:52
License: 暂无描述

Hugging Face2024-01-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/jondurbin/contextual-dpo-v0.1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- # Contextual DPO ![context obedient graphic](context-obedient.png) ## Overview This is a dataset meant to enhance adherence to provided context (e.g., for RAG applications) and reduce hallucinations, specifically using the airoboros context-obedient question answer format. The chosen values were generated with [airoboros](https://github.com/jondurbin/airoboros) using only the `contextual` and `counterfactual_contextual` instructors. The rejected values were generated using [mpt-30b-instruct](https://huggingface.co/mosaicml/mpt-30b-instruct) ### Dataset format The format for a contextual prompt is as follows: ``` BEGININPUT BEGINCONTEXT [key0: value0] [key1: value1] ... other metdata ... ENDCONTEXT [insert your text blocks here] ENDINPUT [add as many other blocks, in the exact same format] BEGININSTRUCTION [insert your instruction(s). The model was tuned with single questions, paragraph format, lists, etc.] ENDINSTRUCTION ``` I know it's a bit verbose and annoying, but after much trial and error, using these explicit delimiters helps the model understand where to find the responses and how to associate specific sources with it. - `BEGININPUT` - denotes a new input block - `BEGINCONTEXT` - denotes the block of context (metadata key/value pairs) to associate with the current input block - `ENDCONTEXT` - denotes the end of the metadata block for the current input - [text] - Insert whatever text you want for the input block, as many paragraphs as can fit in the context. - `ENDINPUT` - denotes the end of the current input block - [repeat as many input blocks in this format as you want] - `BEGININSTRUCTION` - denotes the start of the list (or one) instruction(s) to respond to for all of the input blocks above. - [instruction(s)] - `ENDINSTRUCTION` - denotes the end of instruction set Here's a trivial, but important example to prove the point: ``` BEGININPUT BEGINCONTEXT date: 2021-01-01 url: https://web.site/123 ENDCONTEXT In a shocking turn of events, blueberries are now green, but will be sticking with the same name. ENDINPUT BEGININSTRUCTION What color are bluberries? Source? ENDINSTRUCTION ``` And the expected response: ``` Blueberries are now green. Source: date: 2021-01-01 url: https://web.site/123 ``` ### References in response As shown in the example, the dataset includes many examples of including source details in the response, when the question asks for source/citation/references. Why do this? Well, the R in RAG seems to be the weakest link in the chain. Retrieval accuracy, depending on many factors including the overall dataset size, can be quite low. This accuracy increases when retrieving more documents, but then you have the issue of actually using the retrieved documents in prompts. If you use one prompt per document (or document chunk), you know exactly which document the answer came from, so there's no issue. If, however, you include multiple chunks in a single prompt, it's useful to include the specific reference chunk(s) used to generate the response, rather than naively including references to all of the chunks included in the prompt. For example, suppose I have two documents: ``` url: http://foo.bar/1 Strawberries are tasty. url: http://bar.foo/2 The cat is blue. ``` If the question being asked is `What color is the cat?`, I would only expect the 2nd document to be referenced in the response, as the other link is irrelevant. ### Contribute If you're interested in new functionality/datasets, take a look at [bagel repo](https://github.com/jondurbin/bagel) and [airoboros](https://github.com/jondurbin/airoboros) and either make a PR or open an issue with details. To help me with the fine-tuning costs, dataset generation, etc., please use one of the following: - https://bmc.link/jondurbin - ETH 0xce914eAFC2fe52FdceE59565Dd92c06f776fcb11 - BTC bc1qdwuth4vlg8x37ggntlxu5cjfwgmdy5zaa7pswf

提供机构：

jondurbin

原始信息汇总

Contextual DPO

概述

这是一个旨在增强对提供上下文（例如，用于RAG应用）的遵守并减少幻觉的数据集，特别使用airoboros上下文服从的问题答案格式。

选定的值是通过使用airoboros仅使用contextual和counterfactual_contextual指导器生成的。

被拒绝的值是通过使用mpt-30b-instruct生成的。

数据集格式

上下文提示的格式如下：

BEGININPUT BEGINCONTEXT [key0: value0] [key1: value1] ... 其他元数据 ... ENDCONTEXT [在此插入你的文本块] ENDINPUT [以相同格式添加任意数量的其他块] BEGININSTRUCTION [插入你的指令。模型通过单个问题、段落格式、列表等进行了调整。] ENDINSTRUCTION

我知道这有点冗长和烦人，但在多次尝试和错误之后，使用这些明确的分隔符有助于模型理解在哪里找到响应以及如何将特定来源与其关联。

BEGININPUT - 表示一个新的输入块
BEGINCONTEXT - 表示与当前输入块关联的上下文（元数据键/值对）块
ENDCONTEXT - 表示当前输入的元数据块结束
[text] - 插入你想要的输入块文本，可以插入尽可能多的段落以适应上下文。
ENDINPUT - 表示当前输入块的结束
[以相同格式重复任意数量的输入块]
BEGININSTRUCTION - 表示指令列表（或一个）的开始，用于响应上面的所有输入块。
[instruction(s)]
ENDINSTRUCTION - 表示指令集的结束

以下是一个简单但重要的例子来证明这一点：

BEGININPUT BEGINCONTEXT date: 2021-01-01 url: https://web.site/123 ENDCONTEXT 在令人震惊的转折中，蓝莓现在是绿色的，但将保持相同的名称。 ENDINPUT BEGININSTRUCTION 蓝莓是什么颜色的？来源？ ENDINSTRUCTION

预期的响应：

蓝莓现在是绿色的。来源： date: 2021-01-01 url: https://web.site/123

响应中的引用

如示例所示，当问题要求提供来源/引用/参考时，数据集包括许多在响应中包含来源详细信息的示例。

为什么要这样做？嗯，RAG中的R似乎是链条中最薄弱的一环。检索准确性，取决于许多因素，包括整个数据集的大小，可能相当低。当检索更多文档时，这种准确性会增加，但随后你会遇到在提示中实际使用检索到的文档的问题。如果你为每个文档（或文档块）使用一个提示，你知道答案来自哪个文档，所以没有问题。然而，如果你在一个提示中包含多个块，那么在生成响应时包含用于生成响应的特定参考块是有用的，而不是天真地包括提示中包含的所有块的参考。

例如，假设我有两个文档：

url: http://foo.bar/1 草莓很好吃。

url: http://bar.foo/2 猫是蓝色的。

如果被问的问题是猫是什么颜色的？，我只期望在响应中引用第二个文档，因为其他链接是无关的。

5,000+

优质数据集

54 个

任务类型

进入经典数据集