connections-dev/hard_queries
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/connections-dev/hard_queries
下载链接
链接失效反馈官方服务:
资源简介:
---
tags:
- connections-dev
- CREATE
- hard-instances
---
# Hard Queries
**155 queries** where no model produces a path passing `valid=1 AND factuality=1 AND strength>3`.
**Strength** = `min(per-triple salience scores, excluding the last triple)`. The last triple is excluded because it connects to entity_b and is typically generic.
One row per query with per-model paths and scores.
## Source Models
| Model | Dataset |
|-------|---------|
| GPT-5.4 | `connections-dev/res_gptoss120b_original_1_reason_medium_0.7_4096_gpt_54` |
| Gemini-3-Pro | `connections-dev/res_gptoss120b_original_1_low_0.7_16384_gemini-3-pro-preview` |
| Gemini-3.1-Pro | `connections-dev/res_gptoss120b_original_1_medium_0.7_16384_gemini-3_1-pro-preview` |
| Claude-Sonnet-4.6 | `connections-dev/res_gptoss120b_original_1_medium_0.7_4096_claude-sonnet-4-6` |
## Columns
| Column | Description |
|--------|-------------|
| `index` | Original dataset index |
| `query` | The CREATE query |
| `entity_a` / `entity_b` / `rel_b` | Source entity, target entity, target relation |
| `{model}_paths` | JSON list of path strings |
| `{model}_factuality_scores` | Per-path factuality (1.0 = non-hallucinated) |
| `{model}_strength_scores` | Per-path strength = min(per-triple salience, excluding last triple) |
| `{model}_validity_scores` | Per-path validity (1.0 = structurally valid) |
| `{model}_num_paths` | Total paths generated |
| `{model}_num_factual` | Paths with factuality = 1.0 |
| `{model}_num_good` | Paths passing all three checks (always 0) |
| `{model}_avg_strength` | Mean strength |
## Statistics
| Metric | GPT-5.4 | Gemini-3-Pro | Gemini-3.1-Pro | Claude-Sonnet-4.6 |
|--------|---------|--------------|----------------|-------------------|
| Avg paths | 30.4 | 10.5 | 6.8 | 16.3 |
| Avg factual | 8.8 | 1.6 | 1.7 | 4.0 |
| Avg strength | 1.73 | 2.36 | 2.10 | 2.02 |
标签:
- connections-dev
- CREATE
- 困难实例
# 困难查询集
本数据集共包含155条查询,所有模型均无法生成满足`valid=1且factuality=1且strength>3`校验条件的路径。
**强度(Strength)** 定义为`排除最后一条三元组后的各三元组显著性得分的最小值`。之所以排除最后一条三元组,是因为其仅连接至实体`entity_b`,且通常为泛化性表述。
每条查询对应一行数据,包含各模型生成的路径及其对应得分。
## 源模型
| 模型 | 数据集路径 |
|-------|---------|
| GPT-5.4 | `connections-dev/res_gptoss120b_original_1_reason_medium_0.7_4096_gpt_54` |
| Gemini-3-Pro | `connections-dev/res_gptoss120b_original_1_low_0.7_16384_gemini-3-pro-preview` |
| Gemini-3.1-Pro | `connections-dev/res_gptoss120b_original_1_medium_0.7_16384_gemini-3_1-pro-preview` |
| Claude-Sonnet-4.6 | `connections-dev/res_gptoss120b_original_1_medium_0.7_4096_claude-sonnet-4-6` |
## 列说明
| 列名 | 说明 |
|--------|-------------|
| `index` | 原始数据集索引 |
| `query` | CREATE 查询语句 |
| `entity_a` / `entity_b` / `rel_b` | 源实体、目标实体、目标关系 |
| `{model}_paths` | 路径字符串组成的JSON列表 |
| `{model}_factuality_scores` | 单路径事实性得分(1.0 表示无幻觉生成) |
| `{model}_strength_scores` | 单路径强度得分,计算公式为`排除最后一条三元组后的各三元组显著性得分的最小值` |
| `{model}_validity_scores` | 单路径合法性得分(1.0 表示结构合法) |
| `{model}_num_paths` | 模型生成的总路径数 |
| `{model}_num_factual` | 事实性得分为1.0的路径数量 |
| `{model}_num_good` | 满足全部三项校验的路径数量(恒为0) |
| `{model}_avg_strength` | 平均强度得分 |
## 统计指标
| 指标 | GPT-5.4 | Gemini-3-Pro | Gemini-3.1-Pro | Claude-Sonnet-4.6 |
|--------|---------|--------------|----------------|-------------------|
| 平均路径数 | 30.4 | 10.5 | 6.8 | 16.3 |
| 平均事实性路径数 | 8.8 | 1.6 | 1.7 | 4.0 |
| 平均强度得分 | 1.73 | 2.36 | 2.10 | 2.02 |
提供机构:
connections-dev



