hyunseoki/memory-reasoning-split-eval-sets

Name: hyunseoki/memory-reasoning-split-eval-sets
Creator: hyunseoki
Published: 2026-04-19 13:20:21
License: 暂无描述

Hugging Face2026-04-19 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/hyunseoki/memory-reasoning-split-eval-sets

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: odc-by task_categories: - question-answering - text-generation language: - en tags: - reasoning - factual-recall - ablation - qwen3 - memory-offloading pretty_name: Memory-Reasoning-Split Stage D Eval Sets size_categories: - 1K<n<10K --- # Memory-Reasoning-Split Stage D Eval Sets Curated + subsetted closed-book evaluation sets used to measure the per-domain factual degradation × reasoning retention trade-off in the [`memory_reasoning_split`](https://github.com/hyunseoklee-ai/memory_split) Stage D/E forget-corpus ablation. All factual rows share a unified schema so a single evaluator can score them: ``` { "question": str, "aliases": list[str], # any normalized alias match counts as a hit "relation": str, "topic": str, "source_dataset": str (popqa_general / sciq subset only; custom splits omit) } ``` ## Files | File | Rows | Kind | Source | |---|---:|---|---| | `popqa_general.jsonl` | 1232 | general-domain closed-book QA | [`akariasai/PopQA`](https://huggingface.co/datasets/akariasai/PopQA) test subset, reshaped to unified schema with aliases + relation + topic | | `math_facts.jsonl` | 100 | hand-curated closed-book | constants, theorems, formulas (calculus, linear algebra, probability, geometry, trigonometry, number theory, ...) | | `code_api_facts.jsonl` | 101 | hand-curated closed-book | Python/NumPy/PyTorch/Pandas/JS/C++/Rust/Go/SQL API signatures + shell/git/HTTP trivia | | `sciq.jsonl` | 500 | subset of [`allenai/sciq`](https://huggingface.co/datasets/allenai/sciq) `test` split, reshaped | converted to the unified QA schema | | `humaneval_prompts.jsonl` | 164 | full [`openai_humaneval`](https://huggingface.co/datasets/openai_humaneval) `test` split | prompt + canonical_solution + test + entry_point | **Why `popqa_general`?** Stage D's forget corpus is Wikipedia-derived, so the most natural "Did you just break general-domain factual recall?" probe is general-topic PopQA (1232 rows across relations like `occupation`, `place_of_birth`, `capital`, etc.). Pairs with the NER-masked training corpus at [`hyunseoki/popqa-mini-ner-knowledge-masks`](https://huggingface.co/datasets/hyunseoki/popqa-mini-ner-knowledge-masks). ## Usage Closed-book factual eval (vLLM-backed): ```bash python scripts/eval_stage_d_factual.py \ --config configs/stage_d/d0_no_forget.yaml \ --adapter_checkpoint outputs/stage_d/d1_wikipedia/checkpoint-1338 \ --eval_jsonl popqa_general.jsonl \ --tag popqa_general --output_dir outputs/stage_d_eval/d1 ``` HumanEval pass@1 (sandboxed subprocess scorer): ```bash python scripts/eval_stage_d_humaneval.py \ --config configs/stage_d/d0_no_forget.yaml \ --adapter_checkpoint outputs/stage_d/d4_all_domains/checkpoint-1338 \ --eval_jsonl humaneval_prompts.jsonl \ --output_dir outputs/stage_d_eval/d4 ``` The `scripts/run_stage_d_eval.sh` orchestrator runs all five splits (`popqa_general`, `math_facts`, `code_api_facts`, `sciq`, `humaneval`) plus MATH-500 and AMC23 reasoning tasks across every trained Stage D/E model in parallel. ## Reproducibility - `math_facts` / `code_api_facts` were hand-authored for this project; all 201 items have ≥ 3 accepted aliases to absorb surface-form variation. - `sciq.jsonl` was produced by `scripts/prepare_stage_d_eval_sets.py` with a deterministic first-500 slice of the SciQ test split. - `humaneval_prompts.jsonl` was produced by the same script; it keeps the upstream prompt/test/entry_point unchanged for standard pass@1 scoring. - `popqa_general.jsonl` was produced by the same script from the `popqa_sharded_test` subset; each row preserves PopQA's `relation`, canonical `topic` (subject), and the full alias list for normalized match. ## Intended use Drop-in eval suite for measuring **per-domain factual retention** and **code usability** of adapters trained with selective forgetting over different forget corpora. Pairs with the retain corpus [`hyunseoki/qwen3-0p6b-openthoughts-self-distill-10k`](https://huggingface.co/datasets/hyunseoki/qwen3-0p6b-openthoughts-self-distill-10k), the NER-masked forget corpus [`hyunseoki/popqa-mini-ner-knowledge-masks`](https://huggingface.co/datasets/hyunseoki/popqa-mini-ner-knowledge-masks), and the dedup index [`hyunseoki/openthoughts3-dedup-index`](https://huggingface.co/datasets/hyunseoki/openthoughts3-dedup-index). All assets are grouped under the [Qwen3 Lambda Gates collection](https://huggingface.co/collections/hyunseoki/qwen3-lambda-gates-knowledge-reasoning-disentanglement-69e20c8e64960042ed4c3159). ## License / Attribution SciQ rows © Allen AI (CC BY-NC 3.0). HumanEval rows © OpenAI (released under the MIT license in the `openai_humaneval` dataset card; please follow its terms). PopQA rows © Asai et al. (MIT license). The hand-curated `math_facts` and `code_api_facts` are CC BY-SA 4.0.

提供机构：

hyunseoki

5,000+

优质数据集

54 个

任务类型

进入经典数据集