AxiomicLabs/LogicMark
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/AxiomicLabs/LogicMark
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
size_categories:
- 1K<n<10K
pretty_name: t
---

# LogicMark
A procedurally generated benchmark for evaluating symbolic logic in language models. Each problem presents a set of variable equality/inequality premises and asks the model to identify which conclusion necessarily follows.
Unlike knowledge-based benchmarks, LogicMark contains no facts a model could have memorised from pretraining. Every problem is generated fresh from abstract variable names (`a`, `b`, `c`, ...), so a model cannot pattern-match to training data - it must actually reason. This makes LogicMark a direct probe of **intrinsic reasoning capability**: the logical structure that has been built into the model's weights through training, independent of world knowledge or surface-level heuristics.
Evaluation is log-likelihood multiple-choice — no chain-of-thought, no prompting tricks. Models are scored purely on how well they assign probability to the correct completion.
---
## Benchmark Results (5000 examples, 500 per hop bucket)
Evaluated using average log-likelihood over ending tokens, normalised by length. Random chance = 25%.
| Model | Params | 1-hop | 2-hop | 3-hop | 4-hop | 5-hop | Avg |
| -------------------------- | ------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Qwen2.5-3B | 3.1B | 48.80% | **54.90%** | 46.00% | 42.40% | **39.40%** | **48.64%** |
| Qwen2.5-Math-1.5B | 1.5B | 47.20% | 53.65% | **46.64%** | **45.73%** | 39.20% | 48.62% |
| GPT-X2-125M (unreleased *for now*) | 125M | **74.80%** | 51.30% | 38.08% | 36.53% | 34.20% | 46.42% |
| Qwen2.5-1.5B | 1.5B | 52.00% | 50.70% | 43.68% | 41.33% | 38.00% | 46.40% |
| pythia-2.8B | 2.8B | 70.40% | 49.60% | 35.84% | 36.80% | 31.40% | 44.50% |
| gpt2-xl | 1.6B | 60.40% | 50.20% | 36.96% | 38.67% | 32.60% | 44.42% |
| SmolLM2-1.7B | 1.7B | 65.00% | 50.50% | 36.00% | 36.93% | 31.20% | 44.36% |
| pythia-70m | 70M | 54.40% | 50.90% | 36.96% | 38.53% | 34.00% | 44.22% |
| gpt2 | 124M | 56.00% | 51.05% | 36.48% | 38.80% | 31.00% | 44.06% |
| Qwen2.5-Coder-1.5B | 1.5B | 45.20% | 48.45% | 42.08% | 41.07% | 34.20% | 44.00% |
| GPT-X-125M | 125M | 57.40% | 50.00% | 35.28% | 38.53% | 31.80% | 43.52% |
| LFM2-350M-Math | 354M | 74.00% | 46.50% | 38.80% | 31.33% | 31.00% | 43.50% |
| pythia-31m | 30M | 56.40% | 48.75% | 35.84% | 37.07% | 31.80% | 42.84% |
| SmolLM2-135M | 135M | 58.80% | 49.35% | 34.80% | 36.67% | 29.60% | 42.78% |
| Qwen2.5-0.5B | 494M | 49.20% | 49.40% | 35.84% | 38.53% | 31.00% | 42.52% |
| pythia-14m | 14M | 68.60% | 47.05% | 33.44% | 34.27% | 29.20% | 42.10% |
| MobileLLM-125M | 125M | 53.20% | 49.85% | 34.64% | 35.20% | 28.40% | 42.04% |
| LFM2.5-350M-Base | 354M | 52.00% | 49.75% | 38.00% | 30.67% | 26.00% | 41.80% |
| pythia-160M | 162M | 52.80% | 47.30% | 33.84% | 34.27% | 30.40% | 40.84% |
## Task Format
```
If:
a = g
e != f
b != g
d = h
c = g
a != b
d = e
a = e
Then
A. c = h
B. c != h
C. c != e
D. b = h
```
The model must select the option that is logically entailed by the premises. Distractors include the flipped version of the correct answer and false statements drawn from the same variable set.
---
## Hop Depth
Each problem is tagged with a **hop depth** — the minimum number of inference steps required to derive the correct answer from the premises:
- **1-hop**: answer is directly stated as a premise
- **2-hop**: requires one transitive step (e.g. `a=b`, `b=c` → `a=c`)
- **3-hop**: requires two transitive steps
- **4-hop+**: longer inference chains
The current dataset (1000 problems) targets the following distribution:
| Hop | Target |
|-----|--------|
| 1 | 10% |
| 2 | 40% |
| 3 | 25% |
| 4 | 15% |
| 5 | 10% |
---
## Graph Styles
Premises are generated using six equality graph topologies, each producing different reasoning patterns:
| Style | Description |
|-------|-------------|
| `chain` | Variables linked in a linear sequence |
| `star` | One central variable connected to many others |
| `clusters` | Groups of variables internally linked |
| `tree` | Binary tree topology — maximises chain depth per variable |
| `bipartite` | Edges only cross between two halves — same-side equalities are structurally impossible |
| `mixed` | Random combination of the above |
---
## Generator
`baseloggen_v2.py` — full generator with hop depth control and equality type balancing.
```python
from baseloggen_v2 import Config, generate_dataset, to_benchmark_format
cfg = Config(
limit=1000,
min_vars=4,
max_vars=8,
min_eq=2,
max_eq=5,
min_neq=1,
max_neq=3,
num_choices=4,
min_answer_hop=2,
target_hop_dist={1: 0.10, 2: 0.40, 3: 0.25, 4: 0.15, 5: 0.10},
styles=("chain", "star", "clusters", "tree", "bipartite", "mixed"),
seed=42,
)
dataset = generate_dataset(cfg)
formatted = [to_benchmark_format(p, i) for i, p in enumerate(dataset)]
```
### Key Config Parameters
| Parameter | Description |
|-----------|-------------|
| `min_answer_hop` | Minimum hop depth for the correct answer (suppresses trivial problems) |
| `target_hop_dist` | Dict mapping hop depth → fraction. Generator samples each bucket exactly. |
| `styles` | Tuple of graph topologies to sample from |
### Design Decisions
**Equality type balance** — correct answers are 50/50 sampled from `=` and `!=` statements to prevent models from exploiting the observation that correct answers tend to be equality statements.
**Exact hop targeting** — when `target_hop_dist` is set, each bucket is generated with rejection sampling targeting exactly that hop depth, rather than relying on natural distribution (which heavily favours 2-hop).
**Flipped distractor** — the negation of the correct answer (e.g. `a = b` → `a != b`) is always included as a distractor to ensure the model can't win by ignoring inequality structure.
---
## Dataset Format
```json
{
"id": "symbolic_00042",
"domain": "Symbolic",
"context": "If:\na = b\nb != c\na = d\n\nThen",
"options": ["a != c", "d = c", "b = d", "a = c"],
"answer_index": 0,
"answer": "a != c",
"hop_depth": 2
}
```
---
提供机构:
AxiomicLabs
搜集汇总
数据集介绍

构建方式
LogicMark基准测试集采用程序化生成方式构建,旨在评估语言模型在符号逻辑推理方面的能力。每个问题均以抽象变量名(如a、b、c)构建等式与不等式前提,并随机生成正确的推论选项及干扰项。构建过程通过图拓扑结构控制推理链的跳数深度(hop depth),涵盖链式、星形、聚类、树状、二分及混合六种图样式,并利用拒绝采样精确匹配目标跳数分布。此外,答案类型在等式与不等式之间保持50/50平衡,且始终包含正确答案的否定形式作为干扰项,以防止模型通过表面模式获取正确结果。
特点
LogicMark的核心特点在于其纯粹的逻辑推理评估属性。所有问题均从抽象变量生成,不存在任何模型可从预训练数据中记忆的事实知识,从而直接探测模型的内在推理能力而非知识检索水平。该测试集采用对数似然多选评估方式,无需链式思维或提示技巧,通过模型对正确补全的赋分能力量化其推理性能。数据集按推理跳数(1至5跳)分层标注,支持细粒度分析模型在不同复杂度推理链上的表现,并开放六种图拓扑结构以考察不同推理模式下的能力差异。
使用方法
使用LogicMark时,用户可直接加载Hugging Face数据集并调用预定义评估脚本。模型需接收形如'If: ... Then'的文本上下文,并从四个选项中选出逻辑上必然成立的结论。评估基于对数似然分数,通过比较模型对每个选项结尾标记的平均概率进行正确性判断,结果以各跳数层次的准确率呈现。用户可借助提供的baseloggen_v2.py生成器调整问题数量、变量范围、跳数分布及图样式等参数,从而定制适合自身研究需求的测试子集,并重复上述评估流程。
背景与挑战
背景概述
LogicMark数据集由Axiomic Labs于近期创建,旨在专门评估语言模型的符号逻辑推理能力。与依赖常识或事实记忆的传统基准不同,该数据集通过程序化生成基于抽象变量名(如a、b、c)的等式与不等式前提,迫使模型必须依赖纯粹的逻辑推演而非训练数据中的模式匹配来得出结论。这一设计直指大语言模型的核心研究问题:其权重中是否内化了真正的推理结构?通过控制推理步长(hop depth)并采用六种图拓扑结构(如链式、星型、树型)生成题目,LogicMark能够细致剖析模型在不同推理复杂度下的表现。在已评估的多个主流模型(如Qwen2.5、Pythia、GPT-2系列)中,性能随推理步长增加而显著下降,揭示了当前模型的逻辑推理能力仍十分有限,从而为领域提供了更具诊断性的评估工具。
当前挑战
LogicMark所解决的领域核心挑战在于,现有自然语言处理基准(如常识问答或文本蕴含)往往混合了知识记忆与逻辑推理,无法独立衡量模型的因果推演能力。该数据集通过去除所有可被记忆的语义信息,专门针对符号逻辑中的传递性(transitive)推理设计了1至5跳的递进式难题,每个问题均包含干扰项(如正确答案的否定形式),迫使模型必须进行精确的演绎推理。在构建过程中,关键挑战包括:1)确保前提图在给定跳数下具有唯一可推导结论,避免多解或歧义;2)通过拒绝采样(rejection sampling)精确平衡不同跳数的样本比例(如10%的1跳、40%的2跳),以反映推理深度对模型性能的梯度影响;3)控制等式与不等式类型的分布(各占50%),防止模型利用统计偏差(如更倾向于选择等式结论)来取巧获胜。
常用场景
经典使用场景
LogicMark数据集专为评估语言模型内在推理能力而设计,其核心在于通过程序化生成的符号逻辑问题,剥离外部知识对模型表现的干扰。每一道题目均采用抽象变量(如a、b、c)构成的等式或不等式前提,要求模型从四个候选结论中甄别出必然成立的逻辑推论。该数据集包含5000个样本,覆盖从1跳至5跳的推理深度阶梯,并引入链式、星形、簇状、树形、二分及混合六种图拓扑结构,全方位考察模型在不同逻辑复杂性下的演绎推理能力。评估采用对数似然多选机制,摒弃链式思维或提示技巧,仅基于模型对正确选项概率分配的精准度进行评分,从而提供对模型符号推理本质能力的纯净度量。
衍生相关工作
LogicMark的诞生催生了一系列后续研究路径,其中最直接的是基于其生成框架开发更复杂的逻辑推理基准。研究者可能扩展变量类型、引入模态逻辑或非单调推理规则,以覆盖更广泛的推理范式。此外,该数据集已成为评估新兴推理增强方法的标准参照,例如对比链式思维提示与隐式推理在不同模型上的效果。其精细化的跳数标签与拓扑分类为分析模型推理深度与结构偏好提供了量化基础,启发了诸如逻辑规则蒸馏、推理任务自适应训练等方向的工作。在模型比较层面,LogicMark排名榜持续被引用于发布新模型的符号推理能力评估报告,推动着语言模型从浅层模式匹配向深层逻辑理解的长足演进。
数据集最近研究
最新研究方向
LogicMark数据集聚焦于评估大语言模型在抽象符号逻辑推理上的原生能力,摒弃了传统基准测试中对事实记忆的依赖,通过程序化生成的等式与不等式前提,迫使模型进行真正的逻辑推导。当前研究前沿集中在多跳推理链的深度效应与模型参数量、架构设计的关联性上,例如在1至5跳的难度梯度中,不同规模模型的表现呈现出显著分化,揭示了参数规模与推理深度之间的非线性关系。该数据集对Graph Topology(链式、星型、簇群等)的精细化设计,进一步探索了逻辑结构复杂度对模型泛化能力的挑战,为构建更鲁棒的推理引擎提供了关键评测工具,其意义在于推动语言模型从表面模式匹配向内在符号运算能力的跃迁。
以上内容由遇见数据集搜集并总结生成



