eastlondoner/nanochat-wasm-examples

Name: eastlondoner/nanochat-wasm-examples
Creator: eastlondoner
Published: 2026-04-09 16:10:46
License: 暂无描述

Hugging Face2026-04-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/eastlondoner/nanochat-wasm-examples

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation language: - en tags: - wasm - coprocessor - synthetic - code - math size_categories: - 100K<n<1M --- # WASM Coprocessor Pretraining Examples Synthetic training data for models that learn to invoke a WebAssembly coprocessor to solve computational tasks. Each example pairs a natural-language question with a WASM bytecode program that solves it. ## Dataset Description - **Train examples**: 430,105 - **Eval examples**: 47,790 - **Scale**: huge - **Categories**: arithmetic, bitwise, byte_copy, compositional, data_query, filesystem, interactive_calc, iterative_refine, linear_algebra, local_variable, mechanical, memory, memory_notebook, multi_step, new_algorithms, programming, random_program, running_tally, sudoku, teach_me, text_conversation, what_if, word_problem ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | Human-readable representation (question + WASM program + answer) | | `input_ids` | list[int] | Pre-tokenized sequence with text tokens (0-65535) and WASM tokens (65536+) | | `category` | string | Problem category (arithmetic, programming, etc.) | | `wasm_program` | string | JSON-serialized WASM program instructions | | `expected_outputs` | list[int] | Expected OUTPUT values from coprocessor execution | | `question` | string | The natural language question | | `answer` | string | The natural language answer | ## Token ID Ranges - **0-65535**: Standard BPE text tokens (NanochatTokenizer, tiktoken-based) - **65536+**: WASM tokens (opcodes, operands, feedback markers) - `65536 + opcode`: WASM instruction (e.g. I32_CONST=0x00, I32_ADD=0x01) - `65536 + 261`: REPL_RESULT (execution feedback marker) - `65536 + 262`: BRANCH_TAKEN_REPL - `65536 + 263`: BRANCH_NOT_TAKEN ## Categories - **arithmetic**: Basic operations, chained expressions, comparisons - **word_problem**: GSM8K-style math word problems - **programming**: Primes, GCD, FizzBuzz, factorial, fibonacci, list operations - **sudoku**: Constraint satisfaction (cell validation) - **bitwise**: AND, OR operations - **memory**: Store/load operations on 256-byte memory - **local_variable**: Set/get/tee computations with local variables - **multi_step**: Chained expressions, loops, nested computations - **filesystem**: Open, read, write, close on virtual files ## Usage ```python from datasets import load_dataset ds = load_dataset("your-username/nanochat-wasm-coprocessor-examples") example = ds["train"][0] print(example["question"]) print(example["text"]) print(f"Token sequence length: {len(example['input_ids'])}") ``` ## License Apache 2.0

提供机构：

eastlondoner

5,000+

优质数据集

54 个

任务类型

进入经典数据集