five

eastlondoner/nanochat-wasm-examples

收藏
Hugging Face2026-04-09 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/eastlondoner/nanochat-wasm-examples
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - en tags: - wasm - coprocessor - synthetic - code - math size_categories: - 100K<n<1M --- # WASM Coprocessor Pretraining Examples Synthetic training data for models that learn to invoke a WebAssembly coprocessor to solve computational tasks. Each example pairs a natural-language question with a WASM bytecode program that solves it. ## Dataset Description - **Train examples**: 430,105 - **Eval examples**: 47,790 - **Scale**: huge - **Categories**: arithmetic, bitwise, byte_copy, compositional, data_query, filesystem, interactive_calc, iterative_refine, linear_algebra, local_variable, mechanical, memory, memory_notebook, multi_step, new_algorithms, programming, random_program, running_tally, sudoku, teach_me, text_conversation, what_if, word_problem ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | Human-readable representation (question + WASM program + answer) | | `input_ids` | list[int] | Pre-tokenized sequence with text tokens (0-65535) and WASM tokens (65536+) | | `category` | string | Problem category (arithmetic, programming, etc.) | | `wasm_program` | string | JSON-serialized WASM program instructions | | `expected_outputs` | list[int] | Expected OUTPUT values from coprocessor execution | | `question` | string | The natural language question | | `answer` | string | The natural language answer | ## Token ID Ranges - **0-65535**: Standard BPE text tokens (NanochatTokenizer, tiktoken-based) - **65536+**: WASM tokens (opcodes, operands, feedback markers) - `65536 + opcode`: WASM instruction (e.g. I32_CONST=0x00, I32_ADD=0x01) - `65536 + 261`: REPL_RESULT (execution feedback marker) - `65536 + 262`: BRANCH_TAKEN_REPL - `65536 + 263`: BRANCH_NOT_TAKEN ## Categories - **arithmetic**: Basic operations, chained expressions, comparisons - **word_problem**: GSM8K-style math word problems - **programming**: Primes, GCD, FizzBuzz, factorial, fibonacci, list operations - **sudoku**: Constraint satisfaction (cell validation) - **bitwise**: AND, OR operations - **memory**: Store/load operations on 256-byte memory - **local_variable**: Set/get/tee computations with local variables - **multi_step**: Chained expressions, loops, nested computations - **filesystem**: Open, read, write, close on virtual files ## Usage ```python from datasets import load_dataset ds = load_dataset("your-username/nanochat-wasm-coprocessor-examples") example = ds["train"][0] print(example["question"]) print(example["text"]) print(f"Token sequence length: {len(example['input_ids'])}") ``` ## License Apache 2.0
提供机构:
eastlondoner
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作