five

microsoft/OpenMementos

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/microsoft/OpenMementos
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: mit size_categories: - 100K<n<1M task_categories: - text-generation tags: - reasoning - chain-of-thought - context-compression - synthetic - memento pretty_name: OpenMementos-228K dataset_info: - config_name: default features: - name: problem dtype: string - name: response dtype: string - name: domain dtype: string - name: source dtype: string - name: difficulty dtype: int64 splits: - name: train num_examples: 228557 - config_name: full features: - name: problem dtype: string - name: response dtype: string - name: domain dtype: string - name: source dtype: string - name: difficulty dtype: int64 - name: sentences sequence: string - name: blocks sequence: sequence: int64 - name: block_summaries sequence: string splits: - name: train num_examples: 228557 configs: - config_name: default data_files: - split: train path: data/train-* default: true - config_name: full data_files: - split: train path: full/train-* --- # OpenMementos-228K A dataset of **228,557** reasoning traces annotated with block segmentation and compressed summaries (mementos), derived from [OpenThoughts-v3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M). Memento is a framework for teaching language models to **manage their own context** during long-form reasoning. Instead of generating one long, unstructured chain-of-thought, memento-trained models segment their reasoning into **blocks**, compress each block into a dense **summary** (a *memento*), and continue reasoning from mementos alone. In the released training data, the paper reports **~6× trace-level compression**, from ~10,900 block tokens to ~1,850 memento tokens per trace. Code: [microsoft/memento](https://github.com/microsoft/memento) ## Quick Start ```python from datasets import load_dataset # Training-ready format (default) ds = load_dataset("microsoft/OpenMementos", split="train") # With pipeline components (sentences, blocks, summaries) ds = load_dataset("microsoft/OpenMementos", "full", split="train") ``` ## Available Subsets ### `default` — Training-ready data Contains pre-assembled memento-formatted responses ready for SFT training. | Column | Type | Description | |--------|------|-------------| | `problem` | string | Problem statement | | `response` | string | Memento-formatted response with block/summary tokens (see format below) | | `domain` | string | `code`, `math`, or `science` | | `source` | string | Original dataset source | | `difficulty` | int | Difficulty rating (code domain only, 6–10) | ### `full` — With pipeline components Everything in `default`, plus the intermediate pipeline outputs for researchers who want to re-segment, analyze, or reconstruct: | Column | Type | Description | |--------|------|-------------| | `sentences` | list[string] | Individual sentences from sentence splitting | | `blocks` | list[list[int]] | Block boundaries as `[start_idx, end_idx]` sentence index pairs | | `block_summaries` | list[string] | Iteratively refined summary for each block | ## Response Format The `response` column contains reasoning traces with Memento special tokens: ``` <think> <\|block_start\|> [reasoning sentences for block 1] <\|block_end\|> <\|summary_start\|> [compressed summary of block 1] <\|summary_end\|> <\|block_start\|> [reasoning sentences for block 2] <\|block_end\|> <\|summary_start\|> [compressed summary of block 2] <\|summary_end\|> ... </think> [final answer] ``` ### Special Tokens | Token | Purpose | |-------|---------| | `<think>` / `</think>` | Reasoning wrapper | | `<\|block_start\|>` / `<\|block_end\|>` | Reasoning block delimiters | | `<\|summary_start\|>` / `<\|summary_end\|>` | Summary (memento) delimiters | During inference with block masking, completed blocks are evicted from the KV cache and the model continues reasoning from the summary tokens alone. ## Dataset Statistics ### Overview | Statistic | Value | |-----------|-------| | Total examples | 228,557 | | Math examples | 123,333 (54%) | | Science examples | 61,485 (27%) | | Code examples | 43,739 (19%) | | Avg sentences per example | 187 | ### Block & Summary Statistics (paper-aligned) | Statistic | Value | |-----------|-------| | Median blocks per example | Math: ~9, Code: ~9, Science: ~7 | | Median block size | Ranges from ~2.3K chars (science) to ~3.8K chars (math) | | Median summary size | ~509–603 chars across domains | | Median compression ratio | Math: 0.16, Code: 0.18, Science: 0.23 | | Block-level compression | ~4×–6× depending on domain | | Average block length | ~1,150 tokens | | Average memento length | ~194 tokens | | Trace-level compression | ~6× (from ~10,900 block tokens to ~1,850 memento tokens per trace) | ### Summary Quality | Statistic | Value | |-----------|-------| | Single-pass pass rate (≥ 8/10) | 28% | | After iterative refinement (≥ 8/10) | 92% | | Judge threshold | 8/10 | | Max judge-feedback rounds used to build the dataset | 2 | ## Data Sources All reasoning traces are derived from [OpenThoughts-v3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M). The released 228K traces inherit the following source composition: | Source | Domain | Count | |--------|--------|-------| | [ai2-adapt-dev/openmath-2-math](https://huggingface.co/datasets/ai2-adapt-dev/openmath-2-math) | math | 123,333 | | [organic-chemistry-questions](https://huggingface.co/datasets/organic-chemistry-questions) | science | 39,097 | | [nvidia/OpenCodeReasoning](https://huggingface.co/datasets/nvidia/OpenCodeReasoning) | code | 25,155 | | [stackexchange-physics](https://huggingface.co/datasets/stackexchange-physics) | science | 22,388 | | [stackexchange_codegolf](https://huggingface.co/datasets/stackexchange_codegolf) | code | 18,584 | ## Data Pipeline The dataset was constructed through the following pipeline: 1. **Sentence Splitting** — Original chain-of-thought traces are partitioned into atomic reasoning sentences while preserving code and multi-line math; this reduces candidate boundaries from ~397 to ~187 per trace on average. 2. **Boundary Scoring** — An LLM scores each sentence boundary on a **0–3** scale for how suitable it is as a block break point. 3. **Block Segmentation** — Sentences are grouped into blocks with algorithmic optimization over boundary scores, while enforcing a minimum block size of 200 tokens and penalizing highly unbalanced partitions. 4. **Summary Generation** — Each block is compressed into a memento that preserves the logically relevant reasoning state needed for subsequent blocks, targeting ~15–25% of the original tokens. 5. **Iterative Refinement** — Summaries are evaluated by an LLM judge on a 0–10 rubric and refined with judge feedback for up to two rounds until they meet a quality threshold (≥8/10). ## Training This dataset can be used to fine-tune models for memento-style generation. The typical training setup: ```python # Example: SFT with the default subset from datasets import load_dataset ds = load_dataset("microsoft/OpenMementos", split="train") # Each example has: problem (user message), response (assistant message) # Format as chat messages for your trainer: def format_chat(example): return { "messages": [ {"role": "user", "content": example["problem"]}, {"role": "assistant", "content": example["response"]}, ] } ds = ds.map(format_chat) ``` ## License MIT License. See [LICENSE](LICENSE) for details. ## Citation ```bibtex @article{memento2026, author={Vasilis Kontonis and Yuchen Zeng and Shivam Garg and Lingjiao Chen and Hao Tang and Ziyan Wang and Ahmed Awadallah and Eric Horvitz and John Langford and Dimitris Papailiopoulos}, title={Memento: Teaching LLMs to Manage Their Own Context}, year={2026}, } ```
提供机构:
microsoft
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作