eastlondoner/nanochat-wasm-examples
收藏Hugging Face2026-04-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/eastlondoner/nanochat-wasm-examples
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- wasm
- coprocessor
- synthetic
- code
- math
size_categories:
- 100K<n<1M
---
# WASM Coprocessor Pretraining Examples
Synthetic training data for models that learn to invoke a WebAssembly coprocessor
to solve computational tasks. Each example pairs a natural-language question with
a WASM bytecode program that solves it.
## Dataset Description
- **Train examples**: 430,105
- **Eval examples**: 47,790
- **Scale**: huge
- **Categories**: arithmetic, bitwise, byte_copy, compositional, data_query, filesystem, interactive_calc, iterative_refine, linear_algebra, local_variable, mechanical, memory, memory_notebook, multi_step, new_algorithms, programming, random_program, running_tally, sudoku, teach_me, text_conversation, what_if, word_problem
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `text` | string | Human-readable representation (question + WASM program + answer) |
| `input_ids` | list[int] | Pre-tokenized sequence with text tokens (0-65535) and WASM tokens (65536+) |
| `category` | string | Problem category (arithmetic, programming, etc.) |
| `wasm_program` | string | JSON-serialized WASM program instructions |
| `expected_outputs` | list[int] | Expected OUTPUT values from coprocessor execution |
| `question` | string | The natural language question |
| `answer` | string | The natural language answer |
## Token ID Ranges
- **0-65535**: Standard BPE text tokens (NanochatTokenizer, tiktoken-based)
- **65536+**: WASM tokens (opcodes, operands, feedback markers)
- `65536 + opcode`: WASM instruction (e.g. I32_CONST=0x00, I32_ADD=0x01)
- `65536 + 261`: REPL_RESULT (execution feedback marker)
- `65536 + 262`: BRANCH_TAKEN_REPL
- `65536 + 263`: BRANCH_NOT_TAKEN
## Categories
- **arithmetic**: Basic operations, chained expressions, comparisons
- **word_problem**: GSM8K-style math word problems
- **programming**: Primes, GCD, FizzBuzz, factorial, fibonacci, list operations
- **sudoku**: Constraint satisfaction (cell validation)
- **bitwise**: AND, OR operations
- **memory**: Store/load operations on 256-byte memory
- **local_variable**: Set/get/tee computations with local variables
- **multi_step**: Chained expressions, loops, nested computations
- **filesystem**: Open, read, write, close on virtual files
## Usage
```python
from datasets import load_dataset
ds = load_dataset("your-username/nanochat-wasm-coprocessor-examples")
example = ds["train"][0]
print(example["question"])
print(example["text"])
print(f"Token sequence length: {len(example['input_ids'])}")
```
## License
Apache 2.0
提供机构:
eastlondoner



