five

jdpressman/retro-weave-eval-rubrics-v0.1

收藏
Hugging Face2024-04-02 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/jdpressman/retro-weave-eval-rubrics-v0.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 language: - en tags: - synthetic --- # RetroInstruct Weave Evaluator Rubrics v0.1 This component of RetroInstruct trains the ability to break subjective weave rubric items like "Is this good writing?" into parts which can be more objectively answered. It is closery related to the [word parts component](https://huggingface.co/datasets/jdpressman/retro-word-parts-v0.1) which is meant to train a similar skill. By making these rubrics the model gains the ability to make in-context text classifiers and discriminators. These can be [used to drive a MCTS](https://github.com/JD-P/minihf), filter language model outputs to heighten the probability they satisfy certain properties, and validate abstract properties of inputs. Being able to make the rubrics itself is crucial for dynamic complex tasks and agent loops, where not every intermediate outcome can have its evaluation method spelled out in advance by a user. Furthermore rubric items can be broken down recursively, allowing for the resolution of uncertainty by repeated reductionism accumulating bits of evidence for a proposition. Unlike the word parts component this synthetic set was not made through [backtranslation](https://arxiv.org/abs/2308.06259). Instead [careful prompting on mistral-large-2402](https://github.com/JD-P/RetroInstruct/tree/main/WeaveEvalRubrics) was used to create the dataset. [128 rubric themes](https://github.com/JD-P/RetroInstruct/blob/main/WeaveEvalRubrics/rubric_themes.txt) were hand-curated by prompting Mistral large in a loop for themes related to topics the author would like the model to be able to handle. No particular method beyond intuition and free association was used to prompt for and select the themes. These 128 themes [were then expanded out into 20 rubric seeds per theme](https://github.com/JD-P/RetroInstruct/blob/main/WeaveEvalRubrics/rubric_seed_prompt.txt) , yielding 2558 questions to be broken down into rubrics (a few were presumably duplicates, or the prompt didn't always yield 20). Each of the 2558 questions [was then broken down into a rubric consisting of 10 yes or no questions](https://github.com/JD-P/RetroInstruct/blob/main/WeaveEvalRubrics/rubrics_prompt.txt). [Unique prompt openings were created](https://github.com/JD-P/RetroInstruct/blob/main/WeaveEvalRubrics/prompts_prompt.txt) for each seed-rubric pair. This is what it looks like put together with Mistral Instruct's prompt format: ``` <s> [INST]Can you help me dissect the rubric item 'Was the conversation persuasive, causing the reader to consider new ideas or perspectives?' into 10 binary sub-questions?[/INST] 1. Does the passage present a clear and compelling argument? 2. Does the passage use logical reasoning or evidence to support its claims? 3. Does the passage appeal to the reader's emotions or values? 4. Does the passage use rhetorical devices, such as metaphor or repetition, to strengthen its argument? 5. Does the passage anticipate and address counterarguments or opposing viewpoints? 6. Does the passage present new or unique ideas that challenge the reader's assumptions? 7. Does the passage use credible sources or authorities to support its claims? 8. Does the passage use persuasive language or tone to influence the reader's opinion? 9. Does the passage provide clear and concrete examples to illustrate its points? 10. Does the passage end with a strong conclusion or call to action, encouraging the reader to consider the new ideas or perspectives presented? ``` ## Usage ### Use Cases * Generate in-context classifiers and discriminators for weave evaluator * General training on the skill/concept of reductionism * Grading rubric generator for writing prompts ### Quickstart With HuggingFace Datasets ``` import datasets eval_rubrics = datasets.load_dataset("jdpressman/retro-weave-eval-rubrics-v0.1") mistral_instruction = "<s> [INST]{}[/INST]{}" for row in eval_rubrics["train"]: prompt = row["prompt_open"].format(seed=row["seed"]) output = "" index = 1 for item in row["rubric"]: output += (str(index) + ". " + item + "\n") index += 1 print(mistral_instruction.format(prompt, output)) ``` ### Raw Quickstart ``` import json with open("train.json") as infile: train = json.load(infile) mistral_instruction = "<s> [INST]{}[/INST]{}" for row in train: prompt = row["prompt_open"].format(seed=row["seed"]) output = "" index = 1 for item in row["rubric"]: output += (str(index) + ". " + item + "\n") index += 1 print(mistral_instruction.format(prompt, output)) ``` ## License I release this component of RetroInstruct into the public domain with the [Creative Commons Zero Public Domain Declaration](https://creativecommons.org/publicdomain/zero/1.0/). Because the corpus is generated by mistral large, one should also consider [the terms of service](https://mistral.ai/terms/#terms-of-service-la-plateforme) for Mistral's API. I am not a lawyer, but my understanding is that the relevant line is: > This assignment is granted worldwide and for the entire legal term of protection of the Outputs by the intellectual property rights applicable as provided for by the applicable law. However, You are expressly prohibited to use the Outputs and/or any modified or derived version of the Outputs to (directly or indirectly) to reverse engineer the Services. An earlier version [included an explicit clause](https://twitter.com/far__el/status/1762202700161151416) banning the use of outputs to train models that might compete with Mistral's services. [The CEO then clarified](https://twitter.com/arthurmensch/status/1762208241927233661) this was a mistake and removed it. The Mistral models are defined to be part of the services earlier in the document, so any operation which would cause you to functionally recover the Mistral large model weights or a logical equivalent is (reasonably) still prohibited. This implies that what is allowed is the purposeful creation of narrow corpora that do not attempt to distill the logits of the underlying model. In other words, when you train your model you should get a meaningfully different model from Mistral large at convergence. Since this synthetic corpus is made for a narrow purpose with a specific format, and any other corpus I add to RetroInstruct will either be similar or use backtranslation which should skew downstream models away from the logits you would learn by training on general question-answer pairs from Mistral I believe this release is compliant with their terms of service. ## Data Structure Each row contains three columns/keys: 1. **prompt_open** - The instruction to be given to the model, these are a python format string template where `{seed}` is meant to be replaced by the actual question to break into parts. 2. **seed** - The question to be broken into 10 yes-no subquestions. 3. **rubric** - The questions that mistral large wrote for that seed. ## Biases and Limitations Before paying to generate the full corpus I read about 10 of the rubrics to make sure they were usually good. They were. While processing the dataset I noticed that some of them misinterpreted a question, e.g. thinking that whether a text used "symbols" means whether it includes emoji or not. Probably the strongest bias [comes from the list of themes](https://github.com/JD-P/RetroInstruct/blob/main/WeaveEvalRubrics/rubric_themes.txt) that were used to start the hierarchy, so you may want to review those to get a sense of whether you want to train your model on this. ## Planned Improvements - Actually training a model on the dataset - Try to filter some of the wackier completions with weave evaluator
提供机构:
jdpressman
原始信息汇总

数据集概述

数据集名称

  • 名称: RetroInstruct Weave Evaluator Rubrics v0.1

数据集内容

  • 目的: 训练模型将主观编织评价标准项分解为可客观回答的部分,如“这是否是好的写作?”。
  • 生成方式: 通过精心设计的提示在Mistral大型模型上生成,而非使用反向翻译。
  • 数据规模: 包含128个主题,每个主题扩展出20个评价种子,总计2558个问题,每个问题分解为10个是或否的子问题。

数据集结构

  • 结构: 每条记录包含三个字段:
    1. prompt_open: 模型接收的指令,为Python格式字符串模板,其中{seed}需替换为实际问题。
    2. seed: 需分解为10个是或否子问题的原始问题。
    3. rubric: Mistral大型模型为该种子生成的子问题。

使用场景

  • 生成上下文分类器和判别器: 用于编织评价器。
  • 训练减少主义技能/概念: 用于一般训练。
  • 评分标准生成器: 用于写作提示。

许可证

数据集局限性

  • 偏见与限制: 部分评价标准可能误解问题,例如将“使用符号”误解为是否包含表情符号。主要偏见可能源自用于启动层次结构的主题列表

未来改进计划

  • 模型训练: 实际在数据集上训练模型。
  • 过滤异常完成: 尝试使用编织评价器过滤一些更奇怪的完成。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作