cds-jb/synthweb-qwen3.5-9b-multiscale-inference
收藏Hugging Face2026-05-21 更新2026-05-31 收录
下载链接:
https://hf-mirror.com/datasets/cds-jb/synthweb-qwen3.5-9b-multiscale-inference
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: doc
dtype: string
- name: scope
dtype: string
- name: verbalizer_prompt
dtype: string
- name: context
dtype: string
- name: atom_text
dtype: string
- name: target_response
dtype: string
- name: incorrect_plausible_answer
dtype: string
- name: bb_answer_rollout_answers
list: string
- name: bb_answer_rollout_scores
list: float64
- name: bb_answer_score_mean
dtype: float64
- name: bb_answer_score_max
dtype: float64
- name: bb_answer_score_min
dtype: float64
- name: row_seed
dtype: int64
- name: doc_id
dtype: string
- name: doc_source
dtype: string
- name: doc_idx
dtype: int64
- name: split_char_offset
dtype: int64
- name: row_splits
struct:
- name: word_lens
dtype: int64
- name: sentence
dtype: int64
- name: paragraph
dtype: int64
- name: whole
dtype: int64
- name: doc_choice_rationale
dtype: string
- name: split_rationale
dtype: string
- name: slot_idx
dtype: float64
- name: target
dtype: string
- name: typicality
dtype: string
- name: abs_index
dtype: float64
- name: signed_index
dtype: float64
- name: signed_index_hint
dtype: float64
- name: actual_word_distance_from_split
dtype: float64
- name: synergy_check
dtype: string
- name: distribution_check
dtype: string
- name: difficulty_check
dtype: string
- name: source
dtype: string
- name: n_tokens_actual
dtype: float64
- name: tokenizer
dtype: string
- name: generator_model
dtype: string
- name: source_model
dtype: string
splits:
- name: train
num_bytes: 6150303735
num_examples: 590741
download_size: 979308451
dataset_size: 6150303735
---
# synthweb-qwen3.5-9b-multiscale-inference
Probing-question dataset built over [`cds-jb/synthweb-qwen3.5-9b`](https://huggingface.co/datasets/cds-jb/synthweb-qwen3.5-9b)
(Qwen3.5-9B) continuations of FineWeb prefixes. Each row is ONE probe
testing whether content on one side of a character-level split in a
prefix+continuation can be recovered from the **latent hidden state** of
the source LM at that split. This dataset is the evaluation harness for
**method M** — an "activation oracle" that decodes hidden-state content
into natural language.
## Why this dataset
Activation-oracle methods claim to recover information that is encoded in
a language model's hidden state but NOT yet manifest in the surface text.
To measure that capability we need probes that satisfy two constraints:
1. **HARD-FROM-TEXT** — a careful reader of the side OPPOSITE to the
target should NOT be able to confidently produce the answer from the
complement text alone (no verbatim quote, no clean paraphrase).
2. **EASY-FROM-LATENT** — the answer should be exactly the kind of
content the source LM has already COMMITTED TO in its hidden state at
the split (the next thematic move it was primed for; the specific
named entity it has been tracking; the conclusion it was building
toward).
Probes that fail (1) measure nothing M adds beyond a competent text
reader. Probes that fail (2) measure entropy, not oracle skill. The
**`bb_answer_score_max`** column in this dataset is precisely the (1)
test — a low value flags a good HARD-FROM-TEXT probe.
For HARD-FROM-LATENT, each spw probe carries a `target_response` field —
the answer a "Big Brother" (BB) monitor would give if it had access to
**both** the complement context AND `atom_text` (i.e. the full-info ground
truth, generated by Haiku in one joint call). The contrast between
`target_response` (full-info answer) and the **blind** `bb_answer_*` columns
(complement-only answer) operationalizes the gap M is supposed to close.
## Probe design — five scope groups
Each source doc (FineWeb prefix + sibling Qwen3.5-9B continuations)
yields up to **16 probes** spanning five scope groups:
| scope | n in this build | atom (what's being recovered) | scoring path |
|---|---:|---|---|
| word | 184,959 | a single content word | local (exact-match / token-logprob) |
| lens | 184,805 | contiguous N tokens past/future the split | local (Qwen tokenizer) |
| sentence | 110,882 | one sentence | Haiku answerer + judge |
| paragraph | 73,932 | one paragraph | Haiku answerer + judge |
| whole | 35,622 | one half-doc (≥ all paragraphs on one side) | Haiku answerer + judge |
**Total probes in this build: 590,741** across 36,997 distinct
source docs (3 round(s)). Per-doc slot allocation is **5 word + 5 lens + 3
sentence + 2 paragraph + 1 whole = 16 slots/row**. Word and lens are
Python-templated (no LLM); sentence / paragraph / whole are written by
Claude Haiku 4.5 in extended-thinking mode (see "Generator" below).
### Per-scope splits
Each row carries **four** char-level split offsets — one per scope group:
- `word_lens` split → applies to the 10 word + lens probes
- `sentence` split → applies to the 3 sentence probes
- `paragraph` split → applies to the 2 paragraph probes
- `whole` split → applies to the 1 whole probe
Haiku picks all four split points in a single response. It may use the
same offset for all four (≈51% of rows in the smoke), or up to four
distinct offsets when different scopes need different boundaries (≈49%
of rows). The per-probe `split_char_offset` field gives the offset
that applies to THAT probe.
### Targets and typicality
For each probe Haiku picks a `target ∈ {prefix, suffix}`:
- **prefix-target** → M sees the suffix; must POST-dict an earlier-in-doc atom.
- **suffix-target** → M sees the prefix; must PRE-dict a later-in-doc atom.
For suffix-target slots, Haiku additionally labels
`typicality ∈ {typical, atypical}` against the cross-sibling
distribution:
- `typical` → the specific claim/event/move recurs in at least 3 other
sibling continuations at a comparable position.
- `atypical` → the claim is essentially unique to the chosen sibling and
absent from the rest.
The `distribution_check` field records Haiku's justification.
### Synergy constraint (per scope)
Each multi-token atom (sentence / paragraph / whole) must require
**synthesizing information across the entire atom**, not a strict
sub-region:
> *sentence* → must integrate the main clause with its qualifying
> clauses / numbers / objects.
> *paragraph* → must combine at least two of the paragraph's sentences.
> *whole* → must integrate across multiple paragraphs/sections (an
> arc, a thesis-evidence chain, a tension between stated stances).
The `synergy_check` field per probe names which sub-pieces must be
integrated.
### Incorrect plausible answer
Every probe (incl. word & lens) carries an `incorrect_plausible_answer`
(IPA) — a hypothetical alternative answer a careful reader could
PLAUSIBLY produce from the complement side, that happens to be wrong.
This enables contrastive scoring (`logP(atom)` vs `logP(IPA)` under M).
The IPA must match the atom's shape (single word for word; same-length
sentence for sentence; etc.) and must NOT be what any sibling actually
produced.
## Generator
**Model**: Claude Haiku 4.5 (`claude-haiku-4-5-20251001`) in
extended-thinking mode (8K thinking-budget, 28K max output tokens),
submitted via Anthropic Message Batches API for ~50% cost reduction.
**What Haiku sees per row**: the FineWeb prefix + up to 3 sibling
Qwen3.5-9B continuations of that prefix (sampled from the source
rollouts dataset). The siblings give Haiku a sense of the source LM's
distribution at this point, which it uses to label typicality and to
craft IPAs that don't accidentally match a real sibling.
**Prompt scaffold** (lightly abridged from
`scripts/oracle_question_prompt.py`):
> You design probing questions for evaluating a method M.
>
> What M does: M is given (i) the PREFIX of a document and (ii) the
> LATENT HIDDEN-STATE of the source LM at the boundary between prefix
> and suffix — the same source LM that generated the suffix. M's
> capability is to use that hidden state to read off content the source
> LM was about to emit (for SUFFIX-target probes) or had committed to
> earlier in the document (for PREFIX-target probes).
>
> The goal of each probe: construct a question whose answer is
> HARD-FROM-TEXT but EASY-FROM-LATENT.
>
> [...]
>
> Self-check before finalizing each probe: mentally try to answer your
> verbalizer using ONLY the complement-side text. If you can produce
> atom_text (or a clean paraphrase) from that alone, the probe is too
> easy on text — pick a different atom or rewrite the verbalizer to
> require inference about what the source LM was INTERNALLY tracking,
> not what is already surface-visible.
**Calibration**: rounds were tuned on a 100-doc calibration sweep with
addendum levels {-3..+3} sweeping HARD-FROM-TEXT strictness; the
build settled on **level -2** ("probes have been too hard. Pick atoms
that share an obvious topical / narrative thread with the complement
so a text-only reader has solid footing, while still leaving the
SPECIFIC content of the atom for M to recover"). This produced a
mean answerer-max score of ~0.50 with substantial bimodality (see below).
**JSON robustness**: ~1.3% of Haiku responses had small JSON quirks
(unescaped quotes inside long string fields, raw newlines). These are
recovered with [`json-repair`](https://pypi.org/project/json-repair/),
lifting end-to-end finalize rate from ~86% (strict parsing only) to
98.7%.
## Answerer + judge (scoring pass)
For sentence / paragraph / whole probes, this dataset additionally
carries 5 stochastic answerer rollouts (Haiku 4.5, temperature=1.0,
max_tokens=800) and a per-rollout judge score from another Haiku 4.5
call (0.0–1.0).
The answerer sees ONLY the complement side of the split (up to 4000
chars) plus the verbalizer, and is asked to produce a concise answer.
The judge sees the verbalizer, the ground-truth `atom_text`, and the
model answer, and emits a single float score:
| score | meaning |
|---|---|
| 0.0 | totally wrong / unrelated |
| 0.5 | partially correct or correct theme but wrong specifics |
| 1.0 | essentially the same content as ground truth |
The **max** across 5 rollouts is the headline filter signal: if even
the best of 5 stochastic attempts can't recover the atom from the
complement, the probe is HARD-FROM-TEXT — exactly what you want to
keep for evaluating M.
### Max-score histogram (this build)
```
[0.0, 0.2) 33,518 ( 17.7%) ######## HARD
[0.2, 0.4) 45,525 ( 24.1%) ############
[0.4, 0.6) 25,803 ( 13.7%) ###### valley
[0.6, 0.8) 52,087 ( 27.6%) #############
[0.8, 1.0] 31,961 ( 16.9%) ######## EASY
```
mean=0.461 std=0.296 · HARD (max<0.4) 41.8% ·
EASY (max≥0.8) 16.9%
The clear central valley around [0.4, 0.6) is the filtering signal:
keep the HARD tail for M-evaluation; the EASY tail is text-derivable
and serves as a sanity check.
### Cost & methodology of the answerer pass
- spw probes scored: **188,819**
- answerer calls: 5 × 188,819 = 944,095
- judge calls: 5 × 188,819 = 944,095
- estimated tokens: ~1227.3M input + ~75.5M output (answerer)
+ ~377.6M input + ~4720.0K output (judge)
- batch-discounted Haiku 4.5 ($0.50/M input · $2.50/M output): **~$1003**
- wall time: ~3–4 hours on Anthropic's batch queue (2 answerer chunks
≤256 MB each, plus 2 judge chunks ≤100,000 requests each — the
Anthropic Message Batches API has a per-batch hard cap of 100K
requests).
If you want to re-score with more rollouts (denoising the per-probe
max-score estimate), reuse `scripts/score_probes.py --model qwen3.5-9b
--n-rollouts K` — the cost is linear in K.
## Schema (per row)
Each row is one probe. Key fields:
| column | type | meaning |
|---|---|---|
| `doc_id` | str | FineWeb-derived id |
| `doc_source` | str | original FineWeb URL |
| `doc` | str | the chosen sibling continuation (prefix + suffix) |
| `doc_idx` | int | which sibling Haiku committed to |
| `split_char_offset` | int | character split for this probe's scope group |
| `row_splits` | dict | all four scope-group splits for the row |
| `slot_idx` | int | 0..15 within the row |
| `scope` | str | word / lens / sentence / paragraph / whole |
| `target` | str | prefix or suffix |
| `typicality` | str/null | typical / atypical (suffix only) |
| `abs_index` | int | absolute atom index |
| `signed_index` | int | signed atom index (negative = prefix, positive = suffix) |
| `atom_text` | str/null | text of the target atom |
| `verbalizer_prompt` | str/null | the probing question |
| `incorrect_plausible_answer` | str/null | one hypothetical wrong-but-plausible alternative |
| `synergy_check` | str/null | which sub-pieces must be integrated (multi-token scopes) |
| `distribution_check` | str/null | relation to cross-sibling distribution (suffix slots) |
| `difficulty_check` | str/null | the per-probe HARD-FROM-TEXT / EASY-FROM-LATENT argument |
| `source` | str | python_template or haiku |
| `n_tokens_actual` | int/null | actual token count for lens probes |
| `tokenizer` | str/null | tokenizer used (Qwen3-8B) for lens probes |
| `generator_model` | str | always `claude-haiku-4-5-20251001` |
| `source_model` | str | `qwen3.5-9b` |
| `bb_answer_rollout_answers` | list[str]/null | 5 model answers (spw probes only) |
| `bb_answer_rollout_scores` | list[float]/null | 5 judge scores |
| `bb_answer_score_mean` | float/null | mean of the 5 scores |
| `bb_answer_score_max` | float/null | **max — primary filter signal** |
| `bb_answer_score_min` | float/null | min |
| `target_response` | str | full-info correct answer (Haiku sees context + atom_text); for word/lens probes this equals `atom_text` |
| `context` | str | complement-side raw text (the side opposite to atom_text relative to the scope's split) |
## Reproducibility & incremental growth
The dataset is built in **rounds**. Each round samples N **new**
doc-ids — disjoint from prior rounds — using a deterministic seed. The
local manifest at
`data_pipelines/multiscale_inference/qwen3.5-9b/manifest.json` records
`(round_idx, batch_ids, doc_ids, seeds, submitted_at)` per round, so a
follow-up extension run will pick up where this one left off without
overlap.
Scripts (in [github repo TBD]):
- `scripts/submit_multiscale_inference.py` — submit a round of generator batches
- `scripts/poll_multiscale_inference.py` — fetch + finalize + push probes
- `scripts/score_probes.py` — score with answerer + judge
- `scripts/oracle_question_prompt.py` — the procedural prompt builder
## Source rollouts
- [`cds-jb/synthweb-qwen3.5-9b`](https://huggingface.co/datasets/cds-jb/synthweb-qwen3.5-9b) — Qwen3.5-9B continuations of FineWeb prefixes
(mode-collapse-filtered and detached-tail-truncated).
- See that dataset's README for the rollout filter pipeline.
## License
Same as upstream FineWeb (ODC-By 1.0) for the prefix text; generated
continuations and probes are released under the same license for
research use.
提供机构:
cds-jb



