five

lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric_v2

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric_v2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: conversations list: - name: role dtype: string - name: content dtype: string - name: thinking dtype: string - name: metadata struct: - name: sample_id dtype: string - name: traj_idx dtype: int64 - name: turn_index dtype: int64 - name: tool_name dtype: string - name: tool_query dtype: string - name: refiner_mode dtype: string - name: stop_reason dtype: string - name: accepted dtype: bool - name: pass dtype: int64 - name: format_bonus dtype: float64 - name: citation_format_reward dtype: float64 - name: citation_paper_reward dtype: float64 - name: citation_metrics struct: - name: citation_format_reward dtype: float64 - name: citation_avg_claim_recall dtype: float64 - name: citation_avg_claim_precision dtype: float64 - name: citation_avg_claim_f1 dtype: float64 - name: citation_paper_reward dtype: float64 - name: citation_claim_count dtype: float64 - name: citation_uncited_claim_count dtype: float64 - name: citation_score_applicable dtype: float64 - name: gpt5_generation dtype: string - name: rubrics dtype: string splits: - name: train num_examples: 5195 configs: - config_name: default data_files: - split: train path: data/train-* --- # rl_hard_gpt5_sft_gpt54rubric_v2 Per-instance rubrics generated by GPT-5.4 for the `lihaoxin2020/rl_hard_gpt5_sft` data. Each row has a `rubrics` column (JSON string) with `{"positive_rubrics": [...], "negative_rubrics": [...]}`, each item `{title, description}`, capped at 5 rubrics per instance. ## Difference from v1 (lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric) An LLM audit of v1 found that rubrics were well-tuned for the core answer requirement (direct, cited answer; no fabrication) but systematically did not cover the cases where snippets only **partially** answer the query or do not answer it at all. In v2 the rubric-generator prompt was updated to explicitly classify each instance by **snippet adequacy** (full / partial / none) and include the case-appropriate rubrics: - **PARTIAL / NONE cases** now include: - A positive rubric rewarding **engaged partial-grounding + forward guidance**: citing whichever related clues are present in the snippets AND suggesting a specific next search that would close the gap — not a boilerplate "no info found" refusal. - A negative rubric naming a **concrete close-but-irrelevant detail** from this example's snippets that a weaker model would be tempted to volunteer as if it answered the query (e.g. for "where did A's father die", a sibling's residence mentioned in the snippets). - **FULL cases** now include a negative rubric penalizing **scope creep** — a concrete off-query detail present in the snippets that the query did not ask for. Only written when a concrete distractor exists. On a 30-instance pilot audit, rubric-sets matching the specification improved from **1/30 (3%)** in v1 to **20/30 (67%)** in v2. ## Schema Same as v1; the `rubrics` column is a JSON string with two parallel lists of `{title, description}`. Parse with `json.loads(row["rubrics"])`.

--- dataset_info: 数据集信息: features: - name: conversations(对话) list: - name: role dtype: 字符串类型 - name: content dtype: 字符串类型 - name: thinking(思考过程) dtype: 字符串类型 - name: metadata(元数据) struct: - name: sample_id(样本ID) dtype: 字符串类型 - name: traj_idx(轨迹索引) dtype: 64位整型 - name: turn_index(轮次索引) dtype: 64位整型 - name: tool_name(工具名称) dtype: 字符串类型 - name: tool_query(工具查询词) dtype: 字符串类型 - name: refiner_mode(优化器模式) dtype: 字符串类型 - name: stop_reason(停止原因) dtype: 字符串类型 - name: accepted(是否接受) dtype: 布尔类型 - name: pass(通过次数) dtype: 64位整型 - name: format_bonus(格式奖励) dtype: 浮点类型 - name: citation_format_reward(引用格式奖励) dtype: 浮点类型 - name: citation_paper_reward(引用文献奖励) dtype: 浮点类型 - name: citation_metrics(引用指标) struct: - name: citation_format_reward(引用格式奖励) dtype: 浮点类型 - name: citation_avg_claim_recall(引用平均主张召回率) dtype: 浮点类型 - name: citation_avg_claim_precision(引用平均主张精确率) dtype: 浮点类型 - name: citation_avg_claim_f1(引用平均主张F1值) dtype: 浮点类型 - name: citation_paper_reward(引用文献奖励) dtype: 浮点类型 - name: citation_claim_count(引用主张总数) dtype: 浮点类型 - name: citation_uncited_claim_count(未被引用的主张总数) dtype: 浮点类型 - name: citation_score_applicable(适用引用分数) dtype: 浮点类型 - name: gpt5_generation(GPT-5生成内容) dtype: 字符串类型 - name: rubrics(评分标准) dtype: 字符串类型 splits: - name: train(训练集) num_examples: 5195 configs: - config_name: default(默认配置) data_files: - split: train path: data/train-* --- # rl_hard_gpt5_sft_gpt54rubric_v2 针对`lihaoxin2020/rl_hard_gpt5_sft`数据集生成的逐实例评分标准,由GPT-5.4完成。 每条数据包含一个`rubrics`列(JSON字符串格式),其结构为`{"positive_rubrics": [...], "negative_rubrics": [...]}`,每个条目为`{title, description}`格式,单个实例最多包含5条评分标准。 ## 与v1版本(lihaoxin2020/rl_hard_gpt5_sft_gpt54rubric)的差异 对v1版本开展的大语言模型(Large Language Model, LLM)审计结果显示,评分标准已针对核心回答要求(直接引用、无虚构内容的回答)完成充分调优,但系统性未覆盖仅**部分**回答查询或完全未回答查询的场景。在v2版本中,评分标准生成提示词已更新,以显式地根据**片段充足性**(完整/部分/无)对每个实例进行分类,并添加适配对应场景的评分标准: - **部分回答/无回答**场景新增以下评分标准: - 一条正向评分标准,奖励**针对性的部分信息锚定(partial-grounding)与后续引导**:即引用片段中存在的相关线索,并提出可填补信息缺口的具体下一步搜索建议,而非采用模板化的“未找到相关信息”式拒绝回答。 - 一条负向评分标准,标注该实例片段中存在的**具体近似但无关细节**——即性能较弱的模型会倾向于将其作为回答查询的依据(例如,当查询为“A的父亲于何处去世”时,片段中提及的A的兄弟姐妹的居住地)。 - **完整回答**场景新增一条负向评分标准,用于惩罚**范围溢出(scope creep)**:即片段中存在查询未要求的具体无关细节。仅当存在明确干扰项时才会添加该评分标准。 在包含30个实例的试点审计中,符合规范的评分标准集占比从v1版本的**1/30(3%)**提升至v2版本的**20/30(67%)**。 ## 数据结构 与v1版本一致;`rubrics`列为JSON字符串,包含两个并行的`{title, description}`格式列表。可通过`json.loads(row["rubrics"])`进行解析。
提供机构:
lihaoxin2020
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作