Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Murk is a world simulation engine for reinforcement learning and real-time applications.

It provides a tick-based simulation core with pluggable spatial backends, a modular propagator pipeline, ML-native observation extraction, and Gymnasium-compatible Python bindings — all backed by arena-based generational allocation for deterministic, zero-GC memory management.

Features

  • Spatial backends — Line1D, Ring1D, Square4, Square8, Hex2D, and composable ProductSpace (e.g. Hex2D × Line1D)
  • Propagator pipeline — stateless per-tick operators with automatic write-conflict detection, Euler/Jacobi read modes, and topology-aware CFL validation (max_dt(space))
  • Observation extraction — ObsSpec → ObsPlan → flat f32 tensors with validity masks, foveation, pooling, and multi-agent batching
  • Two runtime modesLockstepWorld (synchronous, borrow-checker enforced) and RealtimeAsyncWorld (background tick thread with epoch-based reclamation)
  • Batched engineBatchedEngine steps N worlds and extracts observations in one call with a single GIL release; BatchedVecEnv provides an SB3-compatible Python interface
  • Deterministic replay — binary replay format with per-tick snapshot hashing and divergence reports
  • Arena allocation — double-buffered ping-pong arenas with Static/PerTick/Sparse field mutability classes; no GC pauses, no Box<dyn> per cell
  • Step metrics observability — per-step timings plus sparse retirement and sparse reuse counters (sparse_retired_ranges, sparse_pending_retired, sparse_reuse_hits, sparse_reuse_misses)
  • C FFI — stable ABI v3.0 with handle tables (slot+generation), panic-safe boundary (MurkStatus::Panicked, murk_last_panic_message), and safe double-destroy
  • Python bindings — PyO3/maturin native extension with Gymnasium Env/VecEnv and BatchedVecEnv for high-throughput training
  • Zero unsafe in simulation logic — only murk-arena and murk-ffi are permitted unsafe; everything else is #![forbid(unsafe_code)]

Architecture

┌─────────────────────────────────────────────────────┐
│  Python (murk)          │  C consumers              │
│  MurkEnv / BatchedVecEnv│  murk_lockstep_step()     │
├────────────┬────────────┴───────────────────────────┤
│ murk-python│           murk-ffi                     │
│ (PyO3)     │        (C ABI, handle tables)          │
├────────────┴────────────────────────────────────────┤
│                    murk-engine                       │
│  LockstepWorld · RealtimeAsyncWorld · BatchedEngine        │
│        TickEngine · IngressQueue · EgressPool        │
├──────────────┬──────────────┬───────────────────────┤
│ murk-propagator │  murk-obs │   murk-replay         │
│ Propagator trait│  ObsSpec  │   ReplayWriter/Reader  │
│ StepContext     │  ObsPlan  │   determinism verify   │
├──────────────┴──┴──────────┬┴───────────────────────┤
│       murk-arena           │      murk-space         │
│  PingPongArena · Snapshot  │  Space trait · backends │
│  ScratchRegion · Sparse    │  regions · edges        │
├────────────────────────────┴────────────────────────┤
│                     murk-core                        │
│    FieldDef · Command · SnapshotAccess · IDs         │
└─────────────────────────────────────────────────────┘

Getting started

Head to the Getting Started guide for installation instructions and your first simulation.

Getting Started

Prerequisites

Rust (for building from source or using the Rust API):

Python (for the Gymnasium bindings):

  • Python 3.12+
  • Install murk from PyPI (numpy >= 1.24 and gymnasium >= 0.29 are installed automatically)
  • maturin only if you are developing Murk from source

Installation

For normal use, install published packages:

cargo add murk
python -m pip install murk

Working on Murk itself (source checkout)

If you are contributing to Murk internals, use a source build:

git clone https://github.com/tachyon-beep/murk.git
cd murk

# Rust: build and test
cargo build --workspace
cargo test --workspace

# Python: build native extension in development mode
cd crates/murk-python
python -m pip install maturin
maturin develop --release

First Rust simulation

Run the built-in quickstart example:

cargo run --example quickstart -p murk-engine

See crates/murk-engine/examples/quickstart.rs for the full source. The essential pattern:

#![allow(unused)]
fn main() {
use murk_core::{FieldDef, FieldId, FieldMutability, FieldType, SnapshotAccess};
use murk_engine::{BackoffConfig, LockstepWorld, WorldConfig};
use murk_space::{EdgeBehavior, Square4};

let space = Square4::new(8, 8, EdgeBehavior::Absorb)?;
let fields = vec![FieldDef {
    name: "heat".into(),
    field_type: FieldType::Scalar,
    mutability: FieldMutability::PerTick,
    ..Default::default()
}];
let config = WorldConfig {
    space: Box::new(space), fields,
    propagators: vec![Box::new(DiffusionPropagator)],
    dt: 1.0, seed: 42, ..Default::default()
};
let mut world = LockstepWorld::new(config)?;
let result = world.step_sync(vec![])?;
let heat = result.snapshot.read(FieldId(0)).unwrap();
}

First Python simulation

import murk
from murk import Config, FieldType, FieldMutability, EdgeBehavior, WriteMode, ObsEntry, RegionType

config = Config()
config.set_space_square4(16, 16, EdgeBehavior.Absorb)
config.add_field("heat", FieldType.Scalar, FieldMutability.PerTick)
murk.add_propagator(
    config, name="diffusion", step_fn=diffusion_step,
    reads_previous=[0], writes=[(0, WriteMode.Full)],
)

env = murk.MurkEnv(config, obs_entries=[ObsEntry(0, region_type=RegionType.All)], n_actions=5)
obs, info = env.reset()

for _ in range(1000):
    action = policy(obs)
    obs, reward, terminated, truncated, info = env.step(action)

Scaling up: BatchedVecEnv

For RL training with many parallel environments, BatchedVecEnv steps all worlds in a single Rust call with one GIL release — eliminating the per-environment FFI overhead of MurkVecEnv.

from murk import BatchedVecEnv, Config, ObsEntry, RegionType

def make_config(i: int) -> Config:
    cfg = Config()
    cfg.set_space_square4(rows=16, cols=16)
    cfg.add_field("temperature", initial_value=0.0)
    return cfg

obs_entries = [ObsEntry(field_id=0, region_type=RegionType.All)]
env = BatchedVecEnv(make_config, obs_entries, num_envs=64)

obs, info = env.reset(seed=42)           # (64, obs_len)
obs, rewards, terms, truncs, info = env.step(actions)
env.close()

Subclass BatchedVecEnv and override the hook methods to customise rewards, termination, and action-to-command mapping for your RL task. See the batched_heat_seeker example for a complete working project, and the Concepts guide for the full API.

Next steps

  • Concepts — understand spaces, fields, propagators, commands, and observations
  • Examples — complete Python RL training examples
  • API Reference — full rustdoc

Murk Concepts Guide

This guide explains the mental model behind Murk. It’s written for someone who has run the heat_seeker example and wants to build something of their own.

Every Murk simulation has five components:

  1. A Space — the topology cells live on
  2. Fields — per-cell data stored in arenas
  3. Propagators — stateless operators that update fields each tick
  4. Commands — how actions from outside enter the simulation
  5. Observations — how state gets extracted for agents or renderers

These components are configured once, compiled into a world, and then ticked forward repeatedly. The rest of this guide explains each one.


Spaces & Topologies

A space defines how many cells exist and which cells are neighbors. Murk ships with seven built-in space backends:

SpaceDimsNeighborsParametersDistance metric
Line1D1D2length, edgeManhattan
Ring1D1D2 (periodic)lengthmin(fwd, bwd)
Square42D4 (N/S/E/W)width, height, edgeManhattan
Square82D8 (+ diagonals)width, height, edgeChebyshev
Hex2D2D6 (pointy-top)cols, rowsCube distance
Fcc123D12 (face-centred cubic)w, h, d, edgeFCC metric
ProductSpaceN-Dvarieslist of component spacesL1 sum

Choosing a space

  • Line1D / Ring1D — 1D cellular automata, queues, pipelines.
  • Square4 — grid worlds, pathfinding, Conway’s Game of Life.
  • Square8 — grid worlds where diagonal movement matters.
  • Hex2D — isotropic 2D movement without diagonal bias.
  • Fcc12 — 3D isotropic lattice (12 equidistant neighbors). Good for volumetric simulations like crystal growth or 3D diffusion.
  • ProductSpace — compose any spaces together (e.g., Hex2D x Line1D for a hex map with a vertical elevation axis).

Edge behaviors

Spaces that have boundaries support three edge behaviors:

BehaviorAt boundaryExample use
AbsorbEdge cells have fewer neighborsBounded arena, finite grid
ClampBeyond-edge maps to edge cellImage processing, extrapolation
WrapWraps to opposite side (torus)Pac-Man map, periodic simulation

Ring1D is always periodic (wrap). Hex2D only supports Absorb.

Coordinates

Every cell has a coordinate — a small vector of i32 values:

  • Line1D / Ring1D: [x]
  • Square4 / Square8: [row, col]
  • Hex2D: [q, r] (axial, pointy-top)
  • Fcc12: [x, y, z] where (x + y + z) % 2 == 0
  • ProductSpace: concatenation of component coordinates

Cells are stored in canonical order (a deterministic traversal of all coordinates). When you read a field as a flat f32 array, element i corresponds to canonical coordinate i. For 2D grids this is row-major order.

Cell count

The number of cells is determined by the space parameters:

  • Line1D(5) → 5 cells
  • Square4(10, 10) → 100 cells
  • Hex2D(8, 8) → 64 cells
  • Fcc12(4, 4, 4) → approximately w*h*d / 2 cells (parity constraint)

This matters because every field allocates cell_count * components floats per generation.


Fields & Mutability

Fields are per-cell data arrays. A 100-cell Square4 world with one Scalar field allocates 100 f32 values for that field.

Field types

TypeStorage per cellUse case
Scalar1 × f32Temperature, density, boolean flags
Vector { dims }dims × f32Velocity, color
Categorical { n_values }1 × f32 (stored as index)Terrain type, cell state

Field mutability

Mutability controls how and when memory is allocated for a field. This is the most important performance decision you’ll make.

MutabilityAllocation patternRead baselineUse when
StaticOnce, never againAlways generation 0Constants (terrain type, wall mask)
PerTickFresh buffer every tickPrevious tick’s valuesFrequently-updated state (heat, positions)
SparseNew buffer only on writeShared until mutatedInfrequently-changed state (terrain HP)

Static fields are allocated once in a shared arena. They’re read-only after initialization — propagators can read them but never write them. Use these for data that never changes (terrain layout, obstacle masks).

PerTick fields get a fresh buffer every tick. If a propagator writes to the field, it fills the new buffer. If nothing writes to the field, the previous tick’s values are copied forward. This is the most common mutability class — use it for anything that changes regularly.

Sparse fields share memory across ticks until something writes to them, at which point a new buffer is allocated (copy-on-write). Use these for data that changes rarely — the arena skips allocation on ticks where the field isn’t modified.

Quick decision guide:

  1. Does this field ever change after initialization? No –> Static
  2. Does it change every tick? Yes –> PerTick
  3. Does it change rarely (< 10% of ticks)? Yes –> Sparse
  4. Unsure? Default to PerTick

Bounds and boundary behavior

Fields can optionally have value bounds (min, max). When a value is written outside those bounds, the BoundaryBehavior determines what happens:

  • Clamp — value is clamped to the nearest bound
  • Reflect — value bounces off the bound
  • Absorb — value is set to the bound
  • Wrap — value wraps to the opposite bound

If you don’t need bounds, just use the defaults.


Propagators

A propagator is a stateless function that runs once per tick. It reads some fields, writes some fields, and that’s it. All simulation logic lives in propagators.

The step signature (Python)

def my_propagator(reads, reads_prev, writes, tick_id, dt, cell_count):
    """
    reads:       list of numpy arrays (fields from current-tick overlay)
    reads_prev:  list of numpy arrays (fields from previous tick, frozen)
    writes:      list of numpy arrays (output buffers to fill)
    tick_id:     int, monotonically increasing tick counter
    dt:          float, simulation timestep in seconds
    cell_count:  int, number of cells in the space
    """
    ...

The step signature (Rust)

#![allow(unused)]
fn main() {
fn step(&self, ctx: &mut StepContext<'_>) -> Result<(), PropagatorError> {
    let prev_heat = ctx.reads_previous().read(HEAT_ID)?;
    let space = ctx.space();
    let writer = ctx.writes();
    // ... compute new values, write to output ...
}
}

Read modes: Euler vs Jacobi

Every propagator declares which fields it reads. There are two read modes:

  • reads (Euler mode) — sees the in-tick overlay. If a prior propagator in the same tick already wrote to this field, you see those new values. This creates a dependency chain between propagators.

  • reads_previous (Jacobi mode) — sees the frozen tick-start snapshot. Always reads the base generation, regardless of what other propagators have written this tick.

The choice matters for correctness:

  • Diffusion should use reads_previous (Jacobi). Otherwise the result depends on cell visit order, which is wrong.
  • A reward propagator that reads an agent-position field written by a movement propagator should use reads (Euler) to see the already-updated position.

Write modes

Each written field has a write mode:

  • WriteMode.Full — the propagator fills every cell. The engine gives you a fresh, zeroed buffer. In debug builds, a coverage guard checks that every cell was written.

  • WriteMode.Incremental — the propagator modifies only some cells. The engine pre-seeds the buffer with the previous tick’s values via memcpy. You only update the cells you need.

Pipeline validation

Murk validates the propagator pipeline at startup:

  • Write conflicts — two propagators writing the same field is an error (detected and reported with both propagator names).
  • CFL stability — if a propagator declares a max_dt, Murk checks that the configured dt doesn’t exceed it.
  • Undefined fields — reading a field that doesn’t exist is an error.

Ordering

Propagators run in the order they’re registered. This ordering, combined with the Euler/Jacobi read declarations, defines the dataflow. The engine precomputes a ReadResolutionPlan that maps each (propagator, field) pair to either the base generation or a prior propagator’s staged output — with zero per-tick routing overhead.


Commands & Ingress

Commands are how actions from outside the simulation (agent actions, user input, network messages) enter the tick loop.

Command types

CommandPurpose
SetField(coord, field_id, value)Write a single cell value
Move(entity_id, target_coord)Move an entity
Spawn(coord, field_values)Create a new entity
Despawn(entity_id)Remove an entity
SetParameter(key, value)Change a global simulation parameter
SetParameterBatch(pairs)Change multiple parameters atomically
Custom(type_id, data)User-defined command type

In the Python API, the most common command is SetField:

cmd = Command.set_field(field_id=1, coord=[5, 3], value=1.0)
receipts, metrics = world.step([cmd])

Python currently exposes Command.set_field(...) and Command.set_parameter(...) convenience constructors. The full command surface (including SetParameterBatch, Move, Spawn, and Despawn) is available at the core/engine level.

Receipts

Every command submitted to step() gets a receipt:

receipts, metrics = world.step([cmd])
for r in receipts:
    print(r.accepted, r.applied_tick_id)

A command can be rejected if the ingress queue is full, the command is stale (refers to an old tick), or the world is shutting down.

Command ordering

Commands are applied in this order: priority_class (lower = higher priority), then source_id, then source_seq, then arrival_seq (monotonic counter). System commands (priority 0) run before user commands (priority 1).


Observations

The observation system extracts field data into flat f32 tensors suitable for neural networks.

The pipeline: ObsSpec → ObsPlan → execute

  1. ObsSpec — a list of ObsEntry objects declaring what to observe.
  2. ObsPlan — a compiled plan (precomputed gather indices). Created once, reused every tick.
  3. execute — runs the plan against the current world snapshot, producing a flat f32 array.
# 1. Specify what to observe
obs_entries = [
    ObsEntry(field_id=0, region_type=RegionType.All),
    ObsEntry(field_id=1, region_type=RegionType.AgentDisk, region_params=[3]),
]

# 2. MurkEnv compiles the plan internally
# 3. Each step(), the plan executes and returns obs as a numpy array
obs, reward, terminated, truncated, info = env.step(action)

Region types

RegionDescriptionWhen to use
AllEvery cell in the spaceFull observability, small grids
AgentDiskCells within region_params=[radius] graph-distance of the agentPartial observability, foveation
AgentRectAxis-aligned bounding box around agent (region_params=[half_w, half_h, ...])Rectangular partial observability

All is the simplest — you get cell_count floats per entry. Agent-centered regions give partial observability and scale better on large grids.

Transforms

Transforms are applied to field values during extraction:

  • Identity — raw values, no change
  • Normalize(min, max) — linearly maps [min, max] to [0, 1], clamping values outside the range

Pooling

For large observations, pooling reduces dimensionality:

  • PoolKernel.NoPool — no pooling (default)
  • PoolKernel.Mean — average of each window
  • PoolKernel.Max — maximum of each window
  • PoolKernel.Min — minimum of each window
  • PoolKernel.Sum — sum of each window

Pooling is configured per-entry with pool_kernel_size and pool_stride.

Observation layout

Entries are concatenated in order. If you observe two fields on a 100-cell grid with region_type=All, you get a 200-element f32 array: the first 100 elements are field 0, the next 100 are field 1.


Runtime Modes

Murk has two runtime modes that share the same tick engine but differ in how you interact with it.

LockstepWorld (synchronous)

The standard mode for RL training:

# Python (via MurkEnv)
obs, reward, terminated, truncated, info = env.step(action)

# Rust
let result = world.step_sync(commands)?;
let snapshot = result.snapshot;  // borrows world

Properties:

  • Blocking step() call — you wait for the tick to complete
  • In Rust, &mut self enforces single-threaded access at compile time
  • The snapshot borrows the world, preventing a new step until you’re done reading
  • Deterministic: same seed + same commands = same result, always

This is what MurkEnv and MurkVecEnv use internally.

RealtimeAsyncWorld (asynchronous)

For real-time applications (game servers, live visualizations):

#![allow(unused)]
fn main() {
// Commands are submitted without blocking
world.submit_commands(commands)?;

// Observations can be taken concurrently
let result = world.observe(&mut plan)?;
}

Properties:

  • Background tick thread runs at a configurable rate
  • Multiple observation requests can be served concurrently via a worker pool
  • Epoch-based reclamation ensures snapshots aren’t freed while being read
  • Command channel provides back-pressure when the queue is full

The Python bindings expose LockstepWorld via MurkEnv/MurkVecEnv, and BatchedEngine via BatchedVecEnv for high-throughput training.

BatchedEngine (high-throughput training)

For RL training at scale, stepping worlds one-by-one through Python has a bottleneck: each step() call acquires and releases the GIL. With thousands of environments, this overhead dominates.

BatchedEngine solves this by owning N LockstepWorld instances and stepping them all in a single Rust call. The GIL is released once, covering the entire step + observe operation for all worlds:

from murk import BatchedVecEnv, Config, ObsEntry, RegionType

def make_config(i: int) -> Config:
    cfg = Config()
    cfg.set_space_square4(rows=16, cols=16)
    cfg.add_field("temperature", initial_value=0.0)
    return cfg

obs_entries = [ObsEntry(field_id=0, region_type=RegionType.All)]
env = BatchedVecEnv(make_config, obs_entries, num_envs=64)

obs, info = env.reset(seed=42)
obs, rewards, terminateds, truncateds, info = env.step(actions)

Architecture (three layers):

LayerClassRole
Rust engineBatchedEngineOwns N LockstepWorlds, step_and_observe()
PyO3 wrapperBatchedWorldHandles GIL release, buffer validation
Pure PythonBatchedVecEnvSB3-compatible API, auto-reset, override hooks

Override hooks let you customise the RL interface without touching Rust:

  • _actions_to_commands(actions) — convert action array to per-world command lists
  • _compute_rewards(obs, tick_ids) — compute per-world rewards
  • _check_terminated(obs, tick_ids) — per-world termination conditions
  • _check_truncated(obs, tick_ids) — per-world truncation conditions

Here is a minimal example of overriding _compute_rewards in a subclass (from the batched_heat_seeker example):

class BatchedHeatSeekerEnv(BatchedVecEnv):
    def step(self, actions):
        # ... move agents, build commands, call step_and_observe ...
        obs = self._obs_flat.reshape(self.num_envs, self._obs_per_world)

        # Vectorized reward: index into (N, cell_count) heat matrix
        heat = obs[:, :CELL_COUNT]
        agent_indices = self._agent_y * GRID_W + self._agent_x
        heat_at_agent = heat[np.arange(self.num_envs), agent_indices]

        terminated = (self._agent_x == SOURCE_X) & (self._agent_y == SOURCE_Y)
        rewards = REWARD_SCALE * heat_at_agent - STEP_PENALTY
        rewards[terminated] += TERMINAL_BONUS

        # ... auto-reset, return obs/rewards/terminated/truncated/info ...

vs MurkVecEnv: MurkVecEnv wraps N independent World objects and calls step() N times (N GIL releases). BatchedVecEnv calls step_and_observe() once (1 GIL release). For 1024 environments, this eliminates ~1023 unnecessary GIL cycles per training step.


Arena & Memory

Murk uses arena-based generational allocation instead of per-object heap allocation. This is what makes it fast and GC-free.

The ping-pong buffer

The engine maintains two segment pools (A and B). On each tick:

  1. One pool is staging (being written by propagators)
  2. The other is published (readable as a snapshot)
  3. After the tick, they swap roles

This means the previous tick’s data is always available for reading while the current tick is being computed.

How mutability maps to memory

  • Static fields live in a separate shared arena. They’re allocated once and never touched again. No per-tick cost.

  • PerTick fields get a fresh allocation in the staging pool every tick. After publish, the old staging pool (now published) still holds the previous tick’s values — so snapshots and reads_previous work without copying.

  • Sparse fields use a dedicated copy-on-write slab. They share memory across ticks until a propagator writes to them, at which point a new allocation is made. On ticks where nothing writes to a sparse field, there’s zero allocation cost.

Why this matters

  • No garbage collection pauses — arena memory is bulk-freed, not per-object
  • Deterministic memory lifetime — you know exactly when memory is allocated and freed
  • Zero-copy snapshots — reading the previous tick’s data is just a pointer into the published pool

For most users, you don’t need to think about arenas directly. The practical takeaway is: choose the right FieldMutability for your data, and the arena system handles the rest efficiently.


Putting It Together

Here’s how these concepts compose in a typical simulation:

import murk
from murk import (
    Config, FieldMutability, EdgeBehavior,
    WriteMode, ObsEntry, RegionType,
)

# 1. Space: defines topology
config = Config()
config.set_space_square4(32, 32, EdgeBehavior.Wrap)

# 2. Fields: define per-cell data
config.add_field("temperature", mutability=FieldMutability.PerTick)
config.add_field("terrain", mutability=FieldMutability.Static)
config.add_field("agent_pos", mutability=FieldMutability.PerTick)

# 3. Propagator: defines simulation logic
def diffuse(reads, reads_prev, writes, tick_id, dt, cell_count):
    # reads_prev[0] = previous tick's temperature
    # writes[0] = this tick's temperature output
    ...

murk.add_propagator(
    config,
    name="diffusion",
    step_fn=diffuse,
    reads_previous=[0],              # Jacobi read of field 0
    writes=[(0, WriteMode.Full)],    # Full write to field 0
)

config.set_dt(0.1)
config.set_seed(42)

# 4. Observations: define what the agent sees
obs_entries = [
    ObsEntry(0, region_type=RegionType.All),       # Full temperature grid
    ObsEntry(2, region_type=RegionType.AgentDisk, region_params=[5]),  # Agent's local view
]

# 5. Environment: wraps everything in the Gymnasium interface
env = murk.MurkEnv(config, obs_entries, n_actions=5, seed=42)
obs, info = env.reset()
obs, reward, terminated, truncated, info = env.step(action)

For a complete working example, see heat_seeker.


Glossary

TermDefinition
CellA single location in the space. Has a coordinate and one value per field.
TickOne simulation timestep. All propagators run, then the arena publishes.
GenerationArena version counter. Incremented on each publish.
Canonical orderThe deterministic traversal of all coordinates (row-major for 2D grids).
SnapshotRead-only view of the world state at a particular generation.
ObsPlanCompiled observation plan. Precomputes gather indices for fast extraction.
IngressThe command queue that feeds actions into the tick loop.
EgressThe observation pathway that extracts state out of the simulation.
CFL conditionCourant-Friedrichs-Lewy stability constraint: N * D * dt < 1, where N is the neighbor count of the space (e.g., 4 for Square4), D is the diffusion coefficient, and dt is the simulation timestep. When this condition is violated, explicit diffusion becomes numerically unstable – values oscillate or diverge instead of converging. Propagators can declare topology-aware max_dt(space) so the engine rejects configurations that violate their CFL bound at startup.

Examples

Murk ships with seven Python example projects demonstrating different spatial backends, RL integration patterns, and the batched engine.

ExampleSpaceDemonstrates
heat_seekerSquare4PPO RL, diffusion physics, Python propagator
hex_pursuitHex2DMulti-agent, AgentDisk foveation
crystal_navFcc123D lattice navigation
layered_hexProductSpace (Hex2D × Line1D)Multi-floor navigation
batched_heat_seekerSquare4BatchedVecEnv, high-throughput parallel training
batched_benchmarkSquare4BatchedVecEnv vs MurkVecEnv vs raw BatchedWorld throughput comparison
batched_cookbookSquare4Low-level BatchedWorld API: lifecycle, context manager, per-world commands, selective reset

There are also three Rust examples:

ExampleDemonstrates
quickstart.rsRust API: config, propagator, commands, snapshots
realtime_async.rsRealtimeAsyncWorld: background ticking, observe, shutdown
replay.rsDeterministic replay: record, verify, prove determinism

The BatchedVecEnv adapter is demonstrated in the batched engine tests, which show config factory patterns, observation extraction, auto-reset, and override hooks.

Running the Python examples

# Install published murk package (default)
python -m pip install murk

# If you are developing Murk internals from source instead:
# cd crates/murk-python && maturin develop --release && cd ../..

# Run an example
cd examples/heat_seeker
pip install -r requirements.txt
python heat_seeker.py

Murk Error Code Reference

Complete reference of all error types in the Murk simulation framework, organized by subsystem.

How to read this document: Each error type has a quick-reference table for scanning, followed by detailed explanations of cause and remediation. HLD codes reference the High-Level Design document section 9.7 where applicable.


Table of Contents


StepError

Crate: murk-core | File: crates/murk-core/src/error.rs

Errors returned by the tick engine during step(). Corresponds to the TickEngine and Pipeline subsystem codes in HLD section 9.7.

Quick reference

VariantHLD CodeDescription
PropagatorFailed { name, reason }MURK_ERROR_PROPAGATOR_FAILEDA propagator returned an error during execution
AllocationFailedMURK_ERROR_ALLOCATION_FAILEDArena out-of-memory during generation staging
TickRollbackMURK_ERROR_TICK_ROLLBACKCurrent tick rolled back due to propagator failure
TickDisabledMURK_ERROR_TICK_DISABLEDTicking disabled after consecutive rollbacks (Decision J)
DtOutOfRangeMURK_ERROR_DT_OUT_OF_RANGERequested dt exceeds a propagator’s max_dt constraint
ShuttingDownMURK_ERROR_SHUTTING_DOWNWorld is in the shutdown state machine (Decision E)

Details

PropagatorFailed { name: String, reason: PropagatorError }

A propagator returned an error during execution. The name field identifies the failing propagator and reason contains the underlying PropagatorError.

Remediation:

  1. Inspect the wrapped PropagatorError for details.
  2. Check propagator inputs (field values, dt) for validity.
  3. The tick engine will roll back the tick automatically.

AllocationFailed

Arena allocation failed due to out-of-memory during generation staging.

Remediation:

  1. Reduce field count or cell count.
  2. Increase arena segment pool capacity.
  3. Check for epoch reclamation stalls preventing segment reuse (see ObsError::WorkerStalled).

TickRollback

The current tick was rolled back due to a propagator failure. All staged writes are discarded and the world state reverts to the previous generation.

Remediation:

  1. Transient: retry on the next tick.
  2. Persistent: investigate the failing propagator via PropagatorFailed.
  3. Note: commands submitted during a rolled-back tick are dropped.

TickDisabled

Ticking has been disabled after consecutive rollbacks (Decision J). The engine enters a fail-stop state to prevent cascading failures.

Remediation:

  1. The simulation must be reset or reconstructed.
  2. Investigate the root cause of repeated propagator failures before restarting.

DtOutOfRange

The requested dt exceeds a propagator’s max_dt constraint (CFL condition or similar stability limit).

Remediation:

  1. Reduce the configured dt to be at or below the tightest max_dt across all propagators.
  2. Check PipelineError::DtTooLarge for which propagator constrains it.

ShuttingDown

The world is in the shutdown state machine (Decision E). No further ticks will be executed.

Remediation:

  1. Expected during graceful shutdown. Do not retry; the world is terminating.

PropagatorError

Crate: murk-core | File: crates/murk-core/src/error.rs

Errors from individual propagator execution. Returned by Propagator::step() and wrapped in StepError::PropagatorFailed by the tick engine.

Quick reference

VariantHLD CodeDescription
ExecutionFailed { reason }MURK_ERROR_PROPAGATOR_FAILEDPropagator’s step function failed
NanDetected { field_id, cell_index }NaN detected in propagator output
ConstraintViolation { constraint }User-defined constraint violated

Details

ExecutionFailed { reason: String }

The propagator’s step function failed. The reason field contains a human-readable description.

Remediation:

  1. Inspect the reason string.
  2. Common causes: invalid field state, numerical instability, domain-specific constraint violations.

NanDetected { field_id: FieldId, cell_index: Option<usize> }

NaN detected in propagator output during sentinel checking. field_id identifies the affected field; cell_index pinpoints the first NaN cell if known.

Remediation:

  1. Reduce dt to improve numerical stability.
  2. Add clamping or bounds to the propagator logic.
  3. Check for division-by-zero in the propagator.

ConstraintViolation { constraint: String }

A user-defined constraint was violated during propagator execution.

Remediation:

  1. Review the constraint definition and the field state that triggered it.
  2. May indicate an out-of-bounds physical quantity or a domain invariant violation.

IngressError

Crate: murk-core | File: crates/murk-core/src/error.rs

Errors from the ingress (command submission) pipeline. Used in Receipt::reason_code to explain why a command was rejected.

Quick reference

VariantHLD CodeDescription
QueueFullMURK_ERROR_QUEUE_FULLCommand queue at capacity
StaleMURK_ERROR_STALECommand’s basis_tick_id too old
TickRollbackMURK_ERROR_TICK_ROLLBACKTick rolled back; commands dropped
TickDisabledMURK_ERROR_TICK_DISABLEDTicking disabled after consecutive rollbacks
ShuttingDownMURK_ERROR_SHUTTING_DOWNWorld is shutting down

Details

QueueFull

The command queue is at capacity. The ingress pipeline cannot buffer any more commands until the tick engine drains the queue.

Remediation:

  1. Reduce command submission rate.
  2. Increase max_ingress_queue in WorldConfig.
  3. In RL training, this may indicate the agent is submitting faster than the tick rate.

Stale

The command’s basis_tick_id is too old relative to the current tick. The adaptive backoff mechanism rejected it due to excessive skew.

Remediation:

  1. Resubmit the command with a fresh basis tick.
  2. If occurring frequently, the agent’s observation-to-action latency is too high relative to the tick rate.
  3. Adjust backoff parameters in BackoffConfig to control the tolerance.

TickRollback

The tick was rolled back; commands submitted during that tick were dropped.

Remediation:

  1. Resubmit the command on the next tick.
  2. This is a transient condition.

Note: This shares the same HLD code as StepError::TickRollback since both originate from the same tick rollback event.

TickDisabled

Ticking is disabled after consecutive rollbacks. No commands will be accepted until the world is reset.

Remediation:

  1. Reset the simulation.
  2. Investigate the root cause of repeated tick rollbacks (see StepError::TickDisabled).

ShuttingDown

The world is shutting down. No further commands are accepted.

Remediation:

  1. Expected during graceful shutdown. Do not retry.

ObsError

Crate: murk-core | File: crates/murk-core/src/error.rs

Errors from the observation (egress) pipeline. Covers ObsPlan compilation, execution, and snapshot access failures.

Quick reference

VariantHLD CodeDescription
PlanInvalidated { reason }MURK_ERROR_PLAN_INVALIDATEDObsPlan generation does not match current snapshot
TimeoutWaitingForTickMURK_ERROR_TIMEOUT_WAITING_FOR_TICKExact-tick egress request timed out (RealtimeAsync only)
NotAvailableMURK_ERROR_NOT_AVAILABLERequested tick evicted from snapshot ring buffer
InvalidComposition { reason }MURK_ERROR_INVALID_COMPOSITIONObsPlan valid_ratio below 0.35 threshold
ExecutionFailed { reason }MURK_ERROR_EXECUTION_FAILEDObsPlan execution failed mid-fill
InvalidObsSpec { reason }MURK_ERROR_INVALID_OBSSPECMalformed ObsSpec at compilation time
WorkerStalledMURK_ERROR_WORKER_STALLEDEgress worker exceeded max_epoch_hold budget

Details

PlanInvalidated { reason: String }

The ObsPlan’s generation does not match the current snapshot. This occurs when the world topology or field layout has changed since the plan was compiled.

Remediation:

  1. Recompile the ObsPlan via ObsPlan::compile() against the current snapshot.
  2. Plans are invalidated by world resets or structural changes.

TimeoutWaitingForTick

An exact-tick egress request timed out. Only occurs in RealtimeAsync mode when waiting for a specific tick that has not yet been produced.

Remediation:

  1. Increase the timeout budget.
  2. Check if the tick thread is stalled or running slower than expected.
  3. This error does not occur in Lockstep mode.

NotAvailable

The requested tick has been evicted from the snapshot ring buffer. The ring only retains the most recent ring_buffer_size snapshots.

Remediation:

  1. Increase ring_buffer_size in WorldConfig.
  2. Alternatively, consume observations more promptly so they are not evicted before access.

InvalidComposition { reason: String }

The ObsPlan’s valid_ratio is below the 0.35 threshold. Too many entries in the observation spec reference invalid or out-of-bounds regions.

Remediation:

  1. Review the ObsSpec entries.
  2. Ensure field IDs and region specifications are valid for the current world configuration.
  3. The 0.35 threshold means at least 35% of entries must be valid.

ExecutionFailed { reason: String }

ObsPlan execution failed mid-fill. An error occurred while extracting field data into the output buffer.

Remediation:

  1. Inspect the reason string.
  2. Common causes: snapshot was reclaimed during execution, arena error, or malformed plan.
  3. If caused by reclamation, see WorkerStalled and epoch hold settings.

InvalidObsSpec { reason: String }

Malformed ObsSpec detected at compilation time. The observation specification contains structural errors.

Remediation:

  1. Review the ObsSpec structure: check field IDs, region definitions, transforms, and dtypes.
  2. Fix the spec before recompiling.

WorkerStalled

An egress worker exceeded the max_epoch_hold budget (default 100ms). The epoch reclamation system forcibly unpinned the worker to prevent blocking arena garbage collection.

Remediation:

  1. Reduce observation complexity.
  2. Or increase max_epoch_hold_ms in AsyncConfig.
  3. A stalled worker prevents epoch advancement, which blocks arena segment reclamation (see ArenaError::CapacityExceeded).

ConfigError

Crate: murk-engine | File: crates/murk-engine/src/config.rs

Errors detected during WorldConfig::validate() at startup time. These are structural invariant violations that prevent world construction.

Quick reference

VariantHLD CodeDescription
Pipeline(PipelineError)Propagator pipeline validation failed
Arena(ArenaError)Arena configuration is invalid
EmptySpaceSpace has zero cells
NoFieldsNo fields registered
RingBufferTooSmall { configured }ring_buffer_size below minimum of 2
IngressQueueZeromax_ingress_queue is zero
InvalidTickRate { value }tick_rate_hz is NaN, infinite, zero, or negative

Details

Pipeline(PipelineError)

Propagator pipeline validation failed. Wraps a PipelineError.

Remediation:

  1. Inspect the inner PipelineError for the specific pipeline issue.

Arena(ArenaError)

Arena configuration is invalid. Wraps an ArenaError.

Remediation:

  1. Inspect the inner ArenaError for the specific arena issue.

EmptySpace

The configured Space has zero cells. A simulation requires at least one spatial cell.

Remediation:

  1. Provide a space with at least one cell.
  2. Check the space constructor arguments.

NoFields

No fields are registered in the configuration. A simulation requires at least one FieldDef.

Remediation:

  1. Add at least one field definition to WorldConfig::fields.

RingBufferTooSmall { configured: usize }

The ring_buffer_size is below the minimum of 2. The snapshot ring requires at least 2 slots for double-buffering.

Remediation:

  1. Set ring_buffer_size to 2 or greater. Default is 8.

IngressQueueZero

The max_ingress_queue capacity is zero. The ingress pipeline requires at least one slot.

Remediation:

  1. Set max_ingress_queue to 1 or greater. Default is 1024.

InvalidTickRate { value: f64 }

tick_rate_hz is NaN, infinite, zero, or negative. Must be a finite positive number.

Remediation:

  1. Provide a valid positive finite tick_rate_hz value.
  2. Or set it to None for no rate limiting.

PipelineError

Crate: murk-propagator | File: crates/murk-propagator/src/pipeline.rs

Errors from pipeline validation at startup. These are checked once by validate_pipeline() and prevent world construction if any are detected.

Quick reference

VariantHLD CodeDescription
EmptyPipelineNo propagators registered
WriteConflict(Vec<WriteConflict>)Two or more propagators write the same field
UndefinedField { propagator, field_id }Propagator references an undefined field
DtTooLarge { configured_dt, max_supported, constraining_propagator }Configured dt exceeds a propagator’s max_dt
InvalidDt { value }Configured dt is NaN, infinity, zero, or negative

Details

EmptyPipeline

No propagators are registered in the pipeline. At least one propagator is required.

Remediation:

  1. Add at least one propagator to WorldConfig::propagators.

WriteConflict(Vec<WriteConflict>)

Two or more propagators write the same field. Each WriteConflict contains field_id, first_writer, and second_writer.

Remediation:

  1. Ensure each FieldId is written by at most one propagator.
  2. Restructure propagators so that field ownership is exclusive.

The WriteConflict struct:

FieldTypeDescription
field_idFieldIdThe contested field
first_writerStringName of the first writer (earlier in pipeline order)
second_writerStringName of the second writer (later in pipeline order)

UndefinedField { propagator: String, field_id: FieldId }

A propagator references (reads, reads_previous, or writes) a field that is not defined in the world’s field list.

Remediation:

  1. Register the missing FieldId in WorldConfig::fields.
  2. Or update the propagator to reference only defined fields.

DtTooLarge { configured_dt: f64, max_supported: f64, constraining_propagator: String }

The configured dt exceeds a propagator’s max_dt constraint. The constraining_propagator field identifies which propagator has the tightest limit.

Remediation:

  1. Reduce WorldConfig::dt to at or below max_supported.
  2. The tightest max_dt across all propagators determines the upper bound.

Note: At runtime, this condition surfaces as StepError::DtOutOfRange.

InvalidDt { value: f64 }

The configured dt is not a valid timestep: NaN, infinity, zero, or negative.

Remediation:

  1. Provide a finite positive dt value in WorldConfig::dt.

ArenaError

Crate: murk-arena | File: crates/murk-arena/src/error.rs

Errors from arena operations. The arena manages generational allocation of field data for the snapshot ring buffer.

Quick reference

VariantHLD CodeDescription
CapacityExceeded { requested, capacity }Segment pool full, cannot allocate
StaleHandle { handle_generation, oldest_live }FieldHandle from a reclaimed generation
UnknownField { field }Unregistered FieldId referenced
NotWritable { field }Field mutability does not permit writes
InvalidConfig { reason }Arena configuration is invalid

Details

CapacityExceeded { requested: usize, capacity: usize }

The segment pool is full and cannot allocate the requested number of bytes. All segments are in use by live generations.

Remediation:

  1. Increase arena capacity.
  2. Ensure epoch reclamation is running – workers must unpin epochs so old generations can be freed (see ObsError::WorkerStalled).
  3. Reduce ring_buffer_size to decrease the number of live generations.

StaleHandle { handle_generation: u32, oldest_live: u32 }

A FieldHandle from a generation that has already been reclaimed was used. The handle’s generation predates the oldest live generation.

Remediation:

  1. This indicates a use-after-free bug in handle management.
  2. Ensure handles are not cached across generation boundaries.
  3. Check that observation plans are recompiled after resets (see ObsError::PlanInvalidated).

UnknownField { field: FieldId }

A FieldId that is not registered in the arena was referenced.

Remediation:

  1. Ensure the field is registered in WorldConfig::fields.
  2. Check that the FieldId index matches the field definition order.

NotWritable { field: FieldId }

Attempted to write a field whose FieldMutability does not permit writes in the current context (e.g., writing a Static field after initialization).

Remediation:

  1. Check the field’s FieldMutability setting.
  2. Static fields can only be set during initialization.
  3. Use PerTick or PerCommand mutability for fields that change during simulation.

InvalidConfig { reason: String }

The arena configuration is invalid.

Remediation:

  1. Inspect the reason string for details.
  2. Typically indicates misconfigured segment sizes or field layouts.
  3. This is wrapped by ConfigError::Arena at startup.

SpaceError

Crate: murk-space | File: crates/murk-space/src/error.rs

Errors from space construction or spatial queries.

Quick reference

VariantHLD CodeDescription
CoordOutOfBounds { coord, bounds }Coordinate outside space bounds
InvalidRegion { reason }Region spec invalid for this topology
EmptySpaceSpace constructed with zero cells
DimensionTooLarge { name, value, max }Dimension exceeds representable range
InvalidComposition { reason }Space composition is invalid

Details

CoordOutOfBounds { coord: Coord, bounds: String }

A coordinate is outside the bounds of the space. The bounds string describes the valid coordinate range.

Remediation:

  1. Validate coordinates before passing them to space methods.
  2. Clamp or reject out-of-bounds coordinates in command processing.

InvalidRegion { reason: String }

A region specification is invalid for this space topology.

Remediation:

  1. Review the RegionSpec being compiled.
  2. Ensure region parameters (center, radius, etc.) are compatible with the space’s dimensionality and bounds.

EmptySpace

Attempted to construct a space with zero cells. All space types require at least one cell.

Remediation:

  1. Provide a positive cell count to the space constructor (e.g., Line1D::new(n, ...) with n >= 1).

Note: This is also caught at the engine level by ConfigError::EmptySpace.

DimensionTooLarge { name: &'static str, value: u32, max: u32 }

A dimension exceeds the representable coordinate range. The name field indicates which dimension (e.g., “len”, “rows”, “cols”).

Remediation:

  1. Reduce the dimension to at or below max.
  2. The limit exists because coordinates are stored as i32 and the space must be indexable.

InvalidComposition { reason: String }

A space composition is invalid (e.g., empty component list, cell count overflow in product spaces).

Remediation:

  1. Review the composition parameters.
  2. For product spaces, ensure components are non-empty and the total cell count fits in usize.

ReplayError

Crate: murk-replay | File: crates/murk-replay/src/error.rs

Errors during replay recording, playback, or determinism comparison.

Quick reference

VariantHLD CodeDescription
Io(io::Error)I/O error during read or write
InvalidMagicFile missing b"MURK" magic bytes
UnsupportedVersion { found }Format version not supported by this build
MalformedFrame { detail }Frame could not be decoded
UnknownPayloadType { tag }Command payload type tag not recognized
ConfigMismatch { recorded, current }Configuration hash mismatch
SnapshotMismatch { tick_id, recorded, replayed }Determinism violation at specified tick

Details

Io(io::Error)

An I/O error occurred during read or write. Wraps a std::io::Error.

Remediation:

  1. Check file permissions, disk space, and path validity.
  2. Inspect the inner io::Error for the specific OS-level failure.

InvalidMagic

The file does not start with the expected b"MURK" magic bytes. The file is not a valid Murk replay.

Remediation:

  1. Verify the file path points to an actual Murk replay file.
  2. The file may be corrupt or a different format.

UnsupportedVersion { found: u8 }

The format version in the file is not supported by this build. The current build supports version 2.

Remediation:

  1. Upgrade or downgrade the Murk library to match the replay file’s format version.
  2. Or re-record the replay with the current version.

MalformedFrame { detail: String }

A frame could not be decoded due to truncated or corrupt data. This includes truncated frame headers (partial tick_id), invalid presence flags, truncated payloads, and invalid UTF-8 strings.

Remediation:

  1. The replay file is corrupt or was truncated (e.g., process crash during recording).
  2. Re-record the replay.
  3. If the truncation is at the end, preceding frames may still be valid.

UnknownPayloadType { tag: u8 }

A command payload type tag is not recognized. The tag value does not correspond to any known CommandPayload variant.

Remediation:

  1. The replay was recorded with a newer version of Murk that has additional command types.
  2. Upgrade the Murk library to match.

ConfigMismatch { recorded: u64, current: u64 }

The replay was recorded with a different configuration hash. The recorded hash (from the file header) does not match the current hash (computed from the live configuration).

Remediation:

  1. Reconstruct the world with the same configuration used during recording: same fields, propagators, dt, seed, space, and ring buffer size.

SnapshotMismatch { tick_id: u64, recorded: u64, replayed: u64 }

A snapshot hash does not match between the recorded and replayed state at the specified tick_id. This indicates a determinism violation.

Remediation:

  1. Check BuildMetadata for toolchain/target differences between recording and replay.
  2. Common causes: floating-point non-determinism across toolchains/platforms, uninitialized memory, non-deterministic iteration order, or external state dependency.

SubmitError

Crate: murk-engine | File: crates/murk-engine/src/realtime.rs

Errors from submitting commands to the tick thread in RealtimeAsyncWorld.

Quick reference

VariantHLD CodeDescription
ShutdownTick thread has shut down
ChannelFullCommand channel is full (back-pressure)

Details

Shutdown

The tick thread has shut down. The command channel is disconnected.

Remediation:

  1. The world has been shut down or dropped. Do not retry.
  2. Create a new world or call reset() if the world supports it.

ChannelFull

The command channel is full (back-pressure). The bounded channel (capacity 64) cannot accept more batches until the tick thread drains it.

Remediation:

  1. Reduce command submission rate.
  2. Wait for the tick thread to process pending batches before submitting more.
  3. This indicates the submitter is outpacing the tick rate.

BatchError

Crate: murk-engine | File: crates/murk-engine/src/batched.rs

Errors from the batched simulation engine (BatchedEngine). Each variant annotates the failure with context about which world failed.

Quick reference

VariantDescription
Step { world_index, error }A world’s step_sync() failed
Observe(ObsError)Observation extraction failed
Config(ConfigError)World creation failed
InvalidIndex { world_index, num_worlds }World index out of bounds
NoObsPlanObservation method called without an ObsSpec
InvalidArgument { reason }Method argument failed validation

Details

Step { world_index, error }

A world’s step_sync() failed during step_and_observe() or step_all(). The world_index identifies which world and error contains the underlying TickError.

Remediation:

  1. Inspect the wrapped TickError (see StepError).
  2. Check propagator inputs for the failing world.

Observe(ObsError)

Observation extraction failed during observe_all() or step_and_observe().

Remediation:

  1. Check ObsSpec / ObsEntry configuration.
  2. Ensure field names and region specs are valid (see ObsError).

Config(ConfigError)

World creation failed during BatchedEngine::new() or reset_world().

Remediation:

  1. Inspect the wrapped ConfigError.
  2. Check that all configs have matching space topologies and field definitions.

InvalidIndex { world_index, num_worlds }

The requested world index is out of bounds.

Remediation:

  1. Use world_index < num_worlds.
  2. Call num_worlds() to check the batch size.

NoObsPlan

An observation method was called but no ObsSpec was provided at construction.

Remediation:

  1. Pass obs_entries when creating BatchedWorld / BatchedEngine.

InvalidArgument { reason }

A method argument failed validation (e.g., wrong number of command lists, buffer size mismatch).

Remediation:

  1. Read the reason message for specifics.
  2. Common causes: commands.len() != num_worlds, output buffer too small.

InternalError

Layer: murk-ffi / murk-python | Status code: -20

Returned when an internal mutex lock fails, most commonly due to mutex poisoning after a panic in a prior call.

Quick reference

CodeDescription
-20Internal error (typically poisoned mutex on affected handle)

Details

InternalError indicates the API cannot safely continue the requested operation because protected internal state is no longer trusted.

Murk currently follows a world-fatal policy for poisoning:

  1. The affected handle is treated as unhealthy.
  2. Subsequent operations on that handle may continue returning InternalError.
  3. Other independent handles may still operate normally.

Remediation:

  1. Treat InternalError as non-retryable for the same handle.
  2. Destroy and recreate the affected world/plan/config handle.
  3. If preceded by Panicked, capture panic text with murk_last_panic_message.
  4. See docs/design/ffi-poisoning-policy.md for policy and recovery expectations.

Panicked

Layer: murk-ffi / murk-python | Status code: -128

FFI boundary panic status returned when Rust catches a panic inside an exported extern "C" function via ffi_guard!.

Quick reference

CodeDescription
-128Rust panic caught at FFI boundary

Details

Panicked means an internal Rust panic occurred while executing an API call. The panic is caught and converted into a status code instead of unwinding across the C boundary.

Remediation:

  1. Treat this as a bug in murk or a custom propagator.
  2. Retrieve panic text via murk_last_panic_message (or Python exception text) and include it in bug reports.
  3. Recreate the affected world/batch handle if subsequent calls report internal errors.

Murk Replay Wire Format Specification

Binary format for deterministic replay recording and playback. All integers are little-endian. Strings and byte arrays are length-prefixed with a u32 length. No compression, no alignment padding, no self-describing schema.

Current version: 3 Magic: b"MURK" (4 bytes) Byte order: Little-endian throughout

See Primitive Encoding for type definitions used throughout this document.


Table of Contents


File Structure

[Header] [Frame 0] [Frame 1] ... [Frame N-1] [EOF]

A replay file consists of a single header followed by zero or more frames. EOF is detected by a clean zero-byte read at a frame boundary (no sentinel or frame count in the header).


Header Layout

The header is written once at file creation by ReplayWriter::new() and validated on open by ReplayReader::open().

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4        [u8; 4]             Magic bytes: b"MURK"
4       1        u8                  Format version (currently 3)

Build Metadata

Immediately follows the format version. All strings are length-prefixed (u32 length + UTF-8 bytes).

Offset  Size     Type                Description
──────  ────     ────                ───────────
5       4+N      lpstring            toolchain (e.g. "1.78.0")
5+a     4+N      lpstring            target_triple (e.g. "x86_64-unknown-linux-gnu")
5+a+b   4+N      lpstring            murk_version (e.g. "0.1.0")
5+a+b+c 4+N      lpstring            compile_flags (e.g. "release")

Where lpstring means u32 length (LE) + N bytes of UTF-8 data, and a, b, c denote the variable sizes of preceding strings (4 + string length each).

Init Descriptor

Immediately follows build metadata. Contains the simulation initialization parameters needed to reconstruct an identical world for replay.

Offset  Size     Type                Description
──────  ────     ────                ───────────
+0      8        u64 LE              seed: RNG seed for deterministic simulation
+8      8        u64 LE              config_hash: hash of the world configuration
+16     4        u32 LE              field_count: number of fields in the world
+20     8        u64 LE              cell_count: total spatial cells
+28     4+N      lpbytes             space_descriptor: opaque serialized space descriptor

Where lpbytes means u32 length (LE) + N bytes of opaque data.

Total header size: 5 + (4 variable-length strings) + 28 + (1 variable-length byte array) = variable.


Frame Layout

Each frame records a single tick’s command inputs and the resulting snapshot hash for determinism verification. Frames are written sequentially with no padding between them.

Offset  Size     Type                Description
──────  ────     ────                ───────────
+0      8        u64 LE              tick_id: the tick number
+8      4        u32 LE              command_count: number of commands in this frame
+12     ...      [Command]           command_count serialized commands (see below)
+N      8        u64 LE              snapshot_hash: FNV-1a hash of the post-tick snapshot

EOF Detection

When reading frames, a clean EOF (zero bytes available at the start of a frame) returns None (no more frames). A partial read of the 8-byte tick_id header (1-7 bytes) is treated as a truncation error (MalformedFrame), not a clean EOF. This distinguishes complete files from files truncated by a crash during recording.


Command Encoding

Each command within a frame is encoded as follows:

Offset  Size     Type                Description
──────  ────     ────                ───────────
+0      1        u8                  payload_type: discriminant tag (see table below)
+1      4        u32 LE              payload_length: byte length of the payload
+5      N        [u8]                payload: serialized command data (N = payload_length)
+5+N    1        u8                  priority_class: lower = higher priority
+6+N    1        u8                  source_id presence flag (0 = absent, 1 = present)
+7+N    0 or 8   u64 LE              source_id value (only if presence flag = 1)
+...    1        u8                  source_seq presence flag (0 = absent, 1 = present)
+...    0 or 8   u64 LE              source_seq value (only if presence flag = 1)
+...    8        u64 LE              expires_after_tick
+...    8        u64 LE              arrival_seq

Command size: varies from 24 bytes (minimum: 1 + 4 + 0 + 1 + 1 + 1 + 8 + 8 = 24 with empty payload, no source fields) to unbounded depending on payload size and source field presence.

expires_after_tick and arrival_seq are serialized in format version 3.

Presence Flag Encoding

The source_id and source_seq fields use explicit presence flags to distinguish None from Some(0):

Flag valueMeaningFollowing bytes
0x00Absent (None)0 bytes
0x01Present (Some(value))8 bytes (u64 LE)
OtherInvalidDecode error (MalformedFrame)

This encoding was introduced in format version 2 to fix a bug in v1 where Some(0) was indistinguishable from None.


Payload Type Tags

TagConstantCommandPayload Variant
0PAYLOAD_MOVEMove
1PAYLOAD_SPAWNSpawn
2PAYLOAD_DESPAWNDespawn
3PAYLOAD_SET_FIELDSetField
4PAYLOAD_CUSTOMCustom
5PAYLOAD_SET_PARAMETERSetParameter
6PAYLOAD_SET_PARAMETER_BATCHSetParameterBatch

Unrecognized tags produce ReplayError::UnknownPayloadType.


Payload Serialization

Move (tag 0)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       8        u64 LE              entity_id
8       4+N*4    coord               target_coord (see Coord encoding below)

Spawn (tag 1)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4+N*4    coord               coord: spawn location
+a      4        u32 LE              field_values count
+a+4    M*(4+4)  [(u32, f32)]        field_values: array of (FieldId as u32 LE, value as f32 LE)

Despawn (tag 2)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       8        u64 LE              entity_id

SetField (tag 3)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4+N*4    coord               coord: target cell
+a      4        u32 LE              field_id (FieldId inner value)
+a+4    4        f32 LE              value

Custom (tag 4)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4        u32 LE              type_id: user-registered type identifier
4       4        u32 LE              data_length: byte length of opaque data
8       N        [u8]                data: opaque payload (N = data_length)

SetParameter (tag 5)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4        u32 LE              key (ParameterKey inner value)
4       8        f64 LE              value

Total payload size: 12 bytes (fixed).

SetParameterBatch (tag 6)

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4        u32 LE              param_count: number of parameters
4       N*12     [(u32, f64)]        params: array of (ParameterKey as u32 LE, value as f64 LE)

Each entry is 12 bytes (4 bytes key + 8 bytes value).


Coord Encoding

Coordinates (Coord, which is SmallVec<[i32; 4]>) are serialized as a length-prefixed array of i32 values:

Offset  Size     Type                Description
──────  ────     ────                ───────────
0       4        u32 LE              dimension_count: number of coordinate components
4       N*4      [i32 LE]            components: coordinate values (N = dimension_count)

Total size: 4 + (dimension_count * 4) bytes. For a typical 2D coordinate, this is 12 bytes.


Primitive Encoding

All primitive types use little-endian byte order:

TypeSizeEncoding
u81 byteRaw byte
u324 bytesLittle-endian
u648 bytesLittle-endian
i324 bytesLittle-endian
f324 bytesIEEE 754, little-endian
f648 bytesIEEE 754, little-endian
lpstring4 + N bytesu32 LE length prefix + UTF-8 bytes
lpbytes4 + N bytesu32 LE length prefix + raw bytes

Snapshot Hash

The snapshot_hash field in each frame is an FNV-1a hash computed over the post-tick snapshot state. It is used during replay to verify determinism: after replaying all commands for a tick, the replayed simulation’s snapshot hash is compared against the recorded hash. A mismatch produces ReplayError::SnapshotMismatch.

The hash is computed by snapshot_hash() in crates/murk-replay/src/hash.rs and covers all fields up to field_count.


Version History

Version 3 (current)

  • expires_after_tick and arrival_seq are appended per command as u64 LE values.
  • This preserves command expiry and deterministic ordering metadata through replay.

Version 2

  • source_id and source_seq use presence-flag encoding: a u8 flag (0 = absent, 1 = present) followed by an optional u64 value.
  • This correctly distinguishes None from Some(0).
  • Superseded by version 3. Files with version 2 are rejected with ReplayError::UnsupportedVersion { found: 2 }.

Version 1

  • source_id and source_seq were encoded as bare u64 values where 0 meant “not set”.
  • Bug: Some(0) was indistinguishable from None, causing incorrect replay of commands with source_id = Some(0).
  • Superseded by later versions. Files with version 1 are rejected with ReplayError::UnsupportedVersion { found: 1 }.

Determinism Catalogue (R-DET-6)

Living document cataloging known sources of non-determinism and the mitigations applied in the Murk simulation framework.

Determinism Contract

Murk targets Tier B determinism: same initial state + same seed + same command log + same build + same toolchain + same ISA ⇒ identical outputs across runs.

Determinism holds when all match: build profile, compiler version, CPU ISA family, Cargo feature flags, dependency versions.

Determinism is not promised across: different ISAs, different libm implementations, fast-math builds, or different Murk versions.

Authoritative Surface Area

Changes to the authoritative path require determinism test verification. Changes to the non-authoritative path must never affect world state.

PathSubsystemsRule
AuthoritativeTickEngine, propagator step(), IngressQueue (sort/expiry), snapshot publish, arena allocation/recyclingMust be deterministic. Run cargo test --test determinism after any change.
Non-authoritativeRendering, logging, metrics, wall-clock pacing, egress worker scheduling, StepMetrics timing, CLI toolingMay vary. Must not influence state transitions.

Sources of Non-Determinism

1. HashMap / HashSet Iteration Order

Risk: HashMap and HashSet use randomized hashing by default. Iterating over them produces different orderings across runs.

Mitigation: Banned project-wide via clippy.toml:

disallowed-types = [
    { path = "std::collections::HashMap", reason = "Use IndexMap for deterministic iteration" },
    { path = "std::collections::HashSet", reason = "Use IndexSet for deterministic iteration" },
]

All code uses IndexMap / BTreeMap instead.

Verification: cargo clippy enforces this at CI time.


2. Floating-Point Reassociation

Risk: Compilers may reorder floating-point operations for performance (e.g., -ffast-math), producing different results across builds.

Mitigation:

  • Rust does not enable fast-math by default.
  • All arithmetic uses explicit operation ordering (no auto-vectorization that could reassociate).
  • Build metadata is recorded in the replay header, enabling detection of toolchain differences.

Verification: Replay header stores BuildMetadata.compile_flags and BuildMetadata.toolchain.


2a. Floating-Point Transcendental Functions

Risk: sin, cos, exp, log, and other transcendentals can produce different results across platforms and libm implementations (glibc vs musl vs macOS libm), even with identical source code and no fast-math. This is the most likely source of cross-platform divergence.

Mitigation:

  • Tier B contract explicitly excludes cross-libm determinism.
  • Build metadata records target_triple, enabling detection of platform differences during replay verification.
  • Propagators that use transcendentals are documented as platform-sensitive in their metadata().

Future option: A murk_math shim crate can provide consistent transcendental implementations if Tier C (cross-platform) determinism is needed.

Status: Documented constraint. No mitigation needed for Tier B.


3. Sort Stability

Risk: Unstable sorts may produce different orderings for equal elements across implementations or runs.

Mitigation:

  • Command ordering uses priority_class (primary), source_id (secondary), arrival_seq (final tiebreaker) — all fields are distinct.
  • Agent actions are sorted by agent_id before processing.
  • All sorts use stable sort (sort_by_key / sort_by).

Verification: Scenario 2 (multi-source command ordering) exercises 3 sources with 1000 ticks.


4. Thread Scheduling

Risk: In multi-threaded modes, OS thread scheduling is non-deterministic.

Mitigation:

  • Lockstep mode is single-threaded by design. All propagators execute sequentially in pipeline order.
  • RealtimeAsync mode (future) will use epoch-synchronized snapshots and deterministic command ordering at tick boundaries.

Status: N/A for Lockstep (current scope). Future concern for RealtimeAsync.


5. Arena Recycling

Risk: Memory recycling patterns could theoretically affect state if buffer reuse is order-dependent.

Mitigation:

  • PingPong buffer swap is deterministic: generation N always writes to buffer N % 2, reads from (N-1) % 2.
  • Arena allocations are generation-indexed, not address-indexed.
  • Ring buffer recycling is deterministic (circular index modulo ring size).

Verification: Scenario 4 (arena double-buffer recycling) runs 1100 ticks to exercise multiple full ring buffer cycles.


6. RNG Seed

Risk: Different seeds produce different simulation trajectories.

Mitigation:

  • Seed is stored in the replay header (InitDescriptor.seed).
  • Replay reconstruction uses the same seed.
  • config_hash() includes the seed.

Verification: All scenarios use explicit seeds and verify hash equality.


7. Build Metadata Differences

Risk: Different compilers, optimization levels, or target architectures may produce different floating-point results for the same source code.

Mitigation:

  • BuildMetadata is recorded in every replay file: toolchain, target_triple, murk_version, compile_flags.
  • Replay consumers can warn or reject when metadata doesn’t match.

Status: Detection only. Cross-build determinism is not guaranteed and is explicitly documented as a known limitation.


8. Command Serialization Fidelity

Risk: Fields like expires_after_tick and arrival_seq are runtime-only and should not affect determinism if excluded.

Mitigation:

  • expires_after_tick is NOT serialized in replays. On deserialization, it is set to TickId(u64::MAX) (never expires).
  • arrival_seq is NOT serialized. Set to 0 on deserialization. The ingress pipeline assigns fresh arrival sequences.
  • Only payload, priority_class, source_id, source_seq are recorded.

Verification: Proptest round-trip tests verify command serialization preserves all payload data. Integration tests verify replay produces identical snapshots despite sentinel values.


9. Parallelism Introduction (Future)

Risk: When parallel propagators or Rayon-based batched stepping are introduced, thread scheduling and work-stealing order can produce different floating-point accumulation results. Refcount churn from Arc snapshot sharing can also introduce cache-line ping-pong that affects timing.

Mitigation (planned):

  • Determinism tests must become thread-count invariant: run with 1, 2, and 8 threads and require identical snapshot hashes.
  • Batched stepping must permute world ordering and verify hash stability.
  • Propagator parallelism (if added) must use deterministic reduction patterns (e.g., fixed-order partial sums, not work-stealing reductions).
  • Epoch-based reclamation (already in use) avoids Arc refcount churn.

Status: Not yet applicable. Current architecture is sequential (propagators in pipeline order, batched engine steps worlds in order). This entry exists as a gate: any PR introducing parallelism in the authoritative path must address these mitigations.


Verified Scenarios

#ScenarioTicksStatus
1Sequential-commit vs Jacobi1000PASS
2Multi-source command ordering1000PASS
3WriteMode::Incremental1000PASS
4Arena double-buffer recycling1100PASS
5Sparse field modification1000PASS
6Tick rollback recovery100PASS
7GlobalParameter mid-episode1000PASS
810+ propagator pipeline1000PASS

API Reference

The full Rust API documentation is generated by rustdoc and published alongside this book.

Browse the API Reference (rustdoc) →

You can also build the API docs locally:

cargo doc --workspace --no-deps --open

Contributing to Murk

What to Contribute

There are many ways to help improve Murk:

  • Bug reports – found something broken? Open an issue with reproduction steps.
  • Feature requests – have an idea for a new capability? Open an issue to discuss it.
  • Documentation – fix typos, clarify explanations, add examples.
  • New propagators – implement domain-specific propagators (see Adding a new propagator below).
  • New space backends – add spatial topologies (see Adding a new space backend below).
  • Performance improvements – benchmark, profile, and optimize hot paths.
  • Tests – increase coverage, add property tests, stress-test edge cases.

Look for issues labeled good-first-issue for accessible entry points.

Development Environment

Requirements:

  • Rust stable (1.87+) via rustup
  • Rust nightly (for Miri only): rustup toolchain install nightly --component miri
  • Python 3.12+
  • maturin: pip install maturin

Setup:

git clone https://github.com/tachyon-beep/murk.git
cd murk

# Build Rust workspace
cargo build --workspace

# Build Python extension (development mode)
cd crates/murk-python
maturin develop --release
cd ../..

Project Structure

murk/
├── crates/
│   ├── murk-core/          # Leaf crate: IDs, field definitions, commands, traits
│   ├── murk-arena/         # Double-buffered ping-pong arena allocator
│   ├── murk-space/         # Space trait + 7 lattice backends
│   ├── murk-propagator/    # Propagator trait, pipeline validation, step context
│   ├── murk-propagators/   # Reference propagators (diffusion, agent movement)
│   ├── murk-obs/           # Observation specification and tensor extraction
│   ├── murk-engine/        # LockstepWorld, RealtimeAsyncWorld, TickEngine
│   ├── murk-replay/        # Deterministic replay recording/verification
│   ├── murk-ffi/           # C ABI with handle tables
│   ├── murk-python/        # Python/PyO3 bindings + Gymnasium adapters
│   ├── murk-bench/         # Benchmark profiles
│   └── murk-test-utils/    # Shared test fixtures
├── examples/               # Python examples (heat_seeker, hex_pursuit, crystal_nav)
└── docs/                   # Design documents and concepts guide

Dependency graph (simplified):

murk-core
  ↑
murk-arena, murk-space
  ↑
murk-propagator, murk-obs
  ↑
murk-engine
  ↑
murk-ffi → murk-python

Running Tests

# Full workspace test suite (700+ tests)
cargo test --workspace

# Single crate
cargo test -p murk-space

# Python tests
cd crates/murk-python
pytest tests/ -v

# Memory safety (requires nightly)
cargo +nightly miri test -p murk-arena

# Clippy lints (must pass with zero warnings)
cargo clippy --workspace -- -D warnings

# Format check
cargo fmt --all -- --check

CI Expectations

Every push and PR triggers:

JobWhat it checks
cargo checkCompilation across all crates
cargo testFull test suite
clippyLint warnings (zero tolerance)
rustfmtFormatting
miriMemory safety for murk-arena

All five must pass before merging.

Code Style

Rust

  • #![deny(missing_docs)] on all crates — every public item needs a doc comment.
  • #![forbid(unsafe_code)] on all crates except murk-arena and murk-ffi. If your change needs unsafe, it belongs in one of those two crates.
  • Clippy with -D warnings — all clippy suggestions must be resolved.
  • cargo fmt — standard rustfmt formatting, no custom config.

Python

  • Type annotations on all public functions.
  • Docstrings on all public classes and methods.
  • Type stubs (.pyi) must be updated when the Python API changes.

Adding a New Space Backend

Space backends implement the Space trait in murk-space. Follow the pattern of Square4 or Hex2D:

  1. Create crates/murk-space/src/your_space.rs.
  2. Implement the Space trait:
    • ndim(), cell_count(), neighbours(), distance()
    • compile_region(), iter_region(), map_coord_to_tensor_index()
    • canonical_ordering(), canonical_rank()
    • instance_id()
  3. Add pub mod your_space; and pub use your_space::YourSpace; to lib.rs.
  4. Run the compliance test suite — this is critical:
#![allow(unused)]
fn main() {
// In your_space.rs, at the bottom:
#[cfg(test)]
mod tests {
    use super::*;
    use crate::compliance::compliance_tests;

    compliance_tests!(YourSpace, || YourSpace::new(4, 4, EdgeBehavior::Absorb).unwrap());
}
}

The compliance test suite (crates/murk-space/src/compliance.rs) automatically tests all Space trait invariants: canonical ordering consistency, neighbor symmetry, distance triangle inequality, region compilation, and more. If your backend passes compliance tests, it works with the rest of Murk.

  1. Add FFI support in murk-ffi and Python bindings in murk-python if needed.

Adding a New Propagator

Propagators implement the Propagator trait in murk-propagator:

  1. Create your propagator struct (must be Send + 'static).
  2. Implement:
    • name() — human-readable name
    • reads() — fields read via in-tick overlay (Euler style)
    • reads_previous() — fields read from frozen tick-start (Jacobi style)
    • writes() — fields written, with WriteMode::Full or Incremental
    • step(&self, ctx: &mut StepContext) — the per-tick logic
  3. Optionally implement max_dt() for CFL stability constraints.

See crates/murk-engine/examples/quickstart.rs for a complete example, or crates/murk-test-utils/src/fixtures.rs for minimal test propagators.

Key rules:

  • step() must be deterministic (same inputs → same outputs).
  • &self only — propagators are stateless. All mutable state goes through fields.
  • Copy read data to a local buffer before grabbing the write handle (split-borrow limitation in StepContext).

Pull Request Process

  1. Fork the repository and create a branch.
  2. Make your changes with tests.
  3. Ensure all CI checks pass locally (cargo test --workspace && cargo clippy --workspace -- -D warnings).
  4. Open a PR with a clear description of what changed and why.

Commit Messages

This project uses Conventional Commits:

PrefixWhen to use
feat:New feature
fix:Bug fix
docs:Documentation only
ci:CI/CD changes
chore:Maintenance (deps, config)
refactor:Code change that neither fixes nor adds
test:Adding or updating tests
perf:Performance improvement

Use a scope when helpful: feat(space):, fix(python):, ci(release):.

Releasing

Releases are manual, triggered by pushing a git tag:

  1. Bump the version in root Cargo.toml under [workspace.package], and update all inter-crate version = "x.y.z" dependency strings in each crate’s Cargo.toml. The Python package version is derived automatically (via dynamic = ["version"] in pyproject.toml).
  2. Commit and push to main.
  3. Tag and push the tag:
    git tag murk-v<version>
    git push origin murk-v<version>
    
  4. The release workflow runs CI, publishes Rust crates to crates.io, builds Python wheels, and publishes them to PyPI. It also creates a GitHub Release with auto-generated release notes.
  5. If something fails: fix the issue, then re-trigger from the GitHub Actions UI using workflow_dispatch with the same tag (CI is skipped on re-runs).

Dry-run a release locally:

cargo publish --dry-run -p murk-core

Secrets required (set in GitHub repo settings > Secrets):

  • CARGO_REGISTRY_TOKEN — crates.io API token
  • CODECOV_TOKEN — Codecov upload token

Murk Architecture

This document explains Murk’s architecture for developers who want to understand how the engine works internally. For a practical introduction to building simulations, see CONCEPTS.md.


Table of Contents


Design Goals

Murk is a world simulation engine for reinforcement learning and real-time applications. The architecture optimises for:

  • Deterministic replay — identical inputs produce identical outputs across runs on the same platform.
  • Zero-GC memory management — arena allocation with predictable lifetimes, no garbage collection pauses.
  • ML-native observation extraction — pre-compiled observation plans that produce fixed-shape tensors directly, not intermediate representations.
  • Two runtime modes from one codebase — synchronous lockstep for training, asynchronous real-time for live interaction.

Three principles guide every subsystem:

  1. Egress Always Returns — observation extraction always returns, even during tick failures or shutdown. Responses may indicate staleness or degraded coverage via metadata, but the caller always receives data.
  2. Tick-Expressible Time — the engine expresses all internal time references that affect state transitions in tick counts, never wall clocks. This prevents replay divergence.
  3. Asymmetric Mode Dampening — the engine handles staleness and overload differently in each runtime mode, because Lockstep and RealtimeAsync have fundamentally different dynamics.

Crate Structure

murk/
├── murk              Top-level facade (add this one dependency)
├── murk-core          Leaf crate: IDs, field defs, commands, core traits
├── murk-arena         Arena-based generational allocation
├── murk-space         Spatial backends and region planning
├── murk-propagator    Propagator trait, pipeline validation, StepContext
├── murk-propagators   Reference propagators (diffusion, movement, reward)
├── murk-obs           Observation spec, compilation, tensor extraction
├── murk-engine        Simulation engine: LockstepWorld, RealtimeAsyncWorld
├── murk-replay        Deterministic replay recording and verification
├── murk-ffi           C ABI bindings with handle tables
├── murk-python        Python/PyO3 bindings with Gymnasium adapters
├── murk-bench         Benchmark profiles and utilities
└── murk-test-utils    Shared test fixtures

Dependency flow (arrows point from dependee to dependent):

murk-core ──┬── murk-arena ──┬── murk-engine ──┬── murk-ffi
            ├── murk-space ──┤                 └── murk-python
            ├── murk-propagator ─┤
            └── murk-obs ────────┘
                murk-replay ─────┘

Safety boundary: only murk-arena and murk-ffi are permitted unsafe code. Every other crate uses #![forbid(unsafe_code)].


Three-Interface Model

All interaction with a Murk world flows through three interfaces:

[Producers]                           [Consumers]
    |                                      ^
    v                                      |
 Ingress ──(bounded queue)──> TickEngine ──(publish)──> Egress
                                  \                      |
                                   └──(ring buffer)──────┘
  • Ingress accepts commands (intents to change world state). It implements backpressure via a bounded queue, TTL-based expiry, and deterministic drop policies.
  • TickEngine is the sole authoritative mutator. It drains the ingress queue, executes the propagator pipeline, and publishes an immutable snapshot at each tick boundary.
  • Egress reads published snapshots to produce observations. It never mutates world state. In RealtimeAsync mode, egress workers run on a thread pool for concurrent observation extraction.

This separation enforces the key invariant: only TickEngine holds &mut WorldState. Everything else operates on immutable snapshots.


Arena-Based Generational Allocation

This is Murk’s most load-bearing design decision. It replaces traditional copy-on-write with a generational arena scheme:

  1. Each field is stored as a contiguous [f32] allocation in a generational arena.
  2. At tick start, propagators write to fresh allocations in the new generation — no copies required.
  3. Unmodified fields share their allocation across generations (zero-cost structural sharing).
  4. Snapshot publication swaps a ~1KB descriptor of field handles. Cost: <2us.
  5. Old generations remain readable until all snapshot references are released.
PropertyTraditional CoWArena-Generational
Copy costFault-driven, unpredictableZero (allocate fresh)
Snapshot publishClone or forkDescriptor swap, <2us
RollbackUndo log or checkpointFree (abandon generation)
Memory predictabilityFault-drivenBump allocation

Rust type-level enforcement

  • ReadArena (published snapshots): Send + Sync, safe for concurrent reads.
  • WriteArena (staging, exclusive to TickEngine): &mut access, no aliasing possible.
  • Snapshot descriptors contain FieldHandle values (generation-scoped integers), not raw pointers. ReadArena::resolve(handle) provides &[f32] access.
  • Field access requires &FieldArena — the borrow checker enforces arena liveness.

Lockstep arena recycling

In Lockstep mode, two arena buffers alternate roles each tick (ping-pong). The caller’s &mut self borrow on step_sync() guarantees no outstanding snapshot borrows. Memory usage is bounded at 2x the per-generation field footprint regardless of episode length.

RealtimeAsync reclamation

In RealtimeAsync mode, epoch-based reclamation manages arena lifetimes. Each egress worker pins an epoch while reading a snapshot. The TickEngine reclaims old generations only when no worker holds a reference. Stalled workers are detected and torn down to prevent unbounded memory growth.


Runtime Modes

Murk provides two runtime modes from the same codebase. There is no runtime mode-switching — you choose at construction time.

LockstepWorld

A callable struct with &mut self methods. The caller’s thread executes the full pipeline: command processing, propagators, snapshot publication, and observation extraction.

#![allow(unused)]
fn main() {
let mut world = LockstepWorld::new(config)?;
let result = world.step_sync(commands)?;
let heat = result.snapshot.read(FieldId(0)).unwrap();
}
  • Synchronous, deterministic, throughput-maximised.
  • The borrow checker enforces that snapshots are released before the next step.
  • No background threads, no synchronisation overhead.
  • Primary use case: RL training loops, deterministic replay.

RealtimeAsyncWorld

An autonomous tick thread running at a configurable rate (e.g., 60 Hz).

#![allow(unused)]
fn main() {
let async_config = AsyncConfig::default();
let mut world = RealtimeAsyncWorld::new(config, async_config)?;
world.submit_commands(commands)?;
let snapshot = world.latest_snapshot();
let report = world.shutdown();
}
  • Non-blocking command submission and observation extraction.
  • Egress thread pool for concurrent ObsPlan execution.
  • Epoch-based memory reclamation.
  • Primary use case: live games, interactive tools, dashboards.

BatchedEngine

BatchedEngine owns a Vec<LockstepWorld> and an optional ObsPlan. Its hot path, step_and_observe(), steps all worlds sequentially then calls ObsPlan::execute_batch() to fill a contiguous output buffer across all worlds.

Error model: BatchError annotates failures with the world index:

  • Step { world_index, error } — a world’s step_sync() failed
  • Observe(ObsError) — observation extraction failed
  • Config(ConfigError) — world creation or reset failed
  • InvalidIndex { world_index, num_worlds } — index out of bounds
  • NoObsPlan — observation requested without ObsSpec
  • InvalidArgument { reason } — argument validation failed

FFI layer: BATCHED: Mutex<HandleTable<BatchedEngine>> stores engine instances. Nine extern "C" functions expose create, step, observe, reset, destroy, and dimension queries.

PyO3 layer: BatchedWorld caches dimensions at construction time, validates buffer shapes eagerly, and releases the GIL via py.detach() on all hot paths. The Ungil boundary requires casting raw pointers to usize before entering the detached closure.


Threading Model

Lockstep

No dedicated threads. The caller’s thread runs the full tick pipeline. Thread count equals the number of vectorised environments (typically 16-128 for RL training).

RealtimeAsync

Thread(s)RoleOwns
TickEngine (1)Tick loop: drain ingress, run propagators, publish&mut WorldState, WriteArena
Egress pool (N)Execute ObsPlans against snapshots&ReadArena (shared)
Ingress acceptor (0-M)Accept commands, assign arrival_seqWrite end of bounded queue

Snapshot lifetime is managed by epoch-based reclamation, not reference counting. This avoids cache-line ping-pong from atomic refcount updates under high observation throughput.


Spatial Model

Spaces define how many cells exist and which cells are neighbours. All spaces implement the Space trait, which provides:

  • cell_count() — total cells
  • neighbours(cell) — ordered neighbour list
  • distance(a, b) — scalar distance metric
  • Region planning for observation extraction

Built-in backends

SpaceDimsNeighboursEdge handling
Line1D1D2Absorb, Wrap
Ring1D1D2 (periodic)Always wraps
Square42D4 (N/S/E/W)Absorb, Wrap
Square82D8 (+ diagonals)Absorb, Wrap
Hex2D2D6Absorb, Wrap
FCC123D12 (face-centred cubic)Absorb, Wrap

ProductSpace

Spaces can be composed via ProductSpace to create higher-dimensional topologies. For example, Hex2D x Line1D creates a layered hex map where each layer is a hex grid and vertical neighbours are connected via the Line1D component.

#![allow(unused)]
fn main() {
let space = ProductSpace::new(vec![
    Box::new(Hex2D::new(8, EdgeBehavior::Wrap)?),
    Box::new(Line1D::new(3, EdgeBehavior::Absorb)?),
]);
}

Coordinates are concatenated across components. Neighbours vary one component at a time (no diagonal cross-component adjacency).


Field Model

The field model defines how per-cell simulation data is typed, allocated, and bounded.

Fields are per-cell data stored in arenas. Each field has:

  • Type: Scalar (1 float), Vector(n) (n floats), or Categorical(n) (n classes).
  • Mutability class: controls arena allocation strategy.
  • Boundary behaviour: Clamp, Reflect, Absorb, or Wrap.
  • Optional units and bounds metadata.

Mutability classes

ClassArena behaviourUse case
StaticAllocated once in generation 0, shared across all snapshotsTerrain, obstacles
PerTickFresh allocation each tickTemperature, velocity
SparseNew allocation only when modifiedRare events, flags

For vectorised RL (128 envs x 2MB mutable + 8MB shared static): 264MB total vs 1.28GB without Static field sharing.


Propagator Pipeline

Propagators are stateless operators that update fields each tick. They implement the Propagator trait:

#![allow(unused)]
fn main() {
pub trait Propagator: Send + Sync {
    fn name(&self) -> &str;
    fn reads(&self) -> FieldSet;          // current-tick values (Euler)
    fn reads_previous(&self) -> FieldSet; // frozen tick-start values (Jacobi)
    fn writes(&self) -> Vec<(FieldId, WriteMode)>;
    fn max_dt(&self, space: &dyn Space) -> Option<f64>; // topology-aware CFL constraint
    fn step(&self, ctx: &mut StepContext<'_>) -> Result<(), PropagatorError>;
}
}

Key properties:

  • &self signature — propagators are stateless. All mutable state flows through StepContext.
  • Split-borrow readsreads() sees current in-tick values (Euler style), reads_previous() sees frozen tick-start values (Jacobi style). This supports both integration approaches.
  • Write-conflict detection — the pipeline validates at startup that no two propagators write the same field in conflicting modes.
  • CFL validation — if a propagator declares max_dt(space), the engine checks dt <= max_dt at configuration time for the configured topology.
  • Deterministic execution order — propagators run in the order they are registered. The pipeline is a strict ordered list.

Observation Pipeline

The observation pipeline transforms world state into fixed-shape tensors for RL frameworks:

ObsSpec ──(compile)──> ObsPlan ──(execute against snapshot)──> f32 tensor
  1. ObsSpec declares what to observe: which fields, which spatial region, what transforms (normalisation, pooling, foveation).
  2. ObsPlan is a compiled, bound, executable plan. It pre-resolves field offsets, region iterators, index mappings, and pooling kernels. Compilation is done once; execution is the hot path.
  3. Execution fills a caller-allocated buffer with f32 values and a validity mask for non-rectangular domains (e.g., hex grids).

ObsPlans are bound to a world configuration generation. If the world configuration changes (fields added, space resized), plans are invalidated and must be recompiled.


Command Model

Commands are the way external actions enter the simulation. Each command carries:

  • Payload: SetField, Move, Spawn, Despawn, SetParameter, SetParameterBatch, or Custom.
  • TTL: expires_after_tick — tick-based expiry (never wall clock).
  • Priority class: determines application order within a tick.
  • Ordering provenance: source_id, source_seq, and engine-assigned arrival_seq for deterministic ordering.

The TickEngine drains and applies commands in deterministic order:

  1. Resolve apply_tick_id for each command.
  2. Group by tick.
  3. Sort within tick by priority class, then source_id, then source_seq, then arrival_seq.

Every command produces a Receipt reporting whether it was accepted, which tick it was applied at, and a reason code if rejected.


Error Handling and Recovery

Tick atomicity

Tick execution is all-or-nothing. If any propagator fails, all staging writes are abandoned (free with the arena model — just drop the staging generation). The world state remains exactly as it was before the tick.

Recovery behaviour

  • Lockstep: step_sync() returns Err(StepError). The caller decides how to recover (typically reset()).
  • RealtimeAsync: after 3 consecutive rollbacks, the TickEngine disables ticking and rejects further commands. Egress continues serving the last good snapshot (Egress Always Returns). Recovery via reset().

See error-reference.md for the complete error type catalogue.


Determinism

Murk targets Tier B determinism: identical results within the same build, ISA, and toolchain, given the same initial state, seed, and command log.

Determinism Contract

Determinism holds when all of these match between runs:

  • Build profile (debug/release) and optimization level
  • Compiler version (rustc, PyO3/maturin)
  • CPU ISA family (e.g., x86-64, aarch64)
  • Cargo feature flags and dependency versions

Determinism is not promised across:

  • Different ISAs (x86-64 vs aarch64)
  • Different libm implementations (glibc vs musl vs macOS)
  • Builds with fast-math or non-default RUSTFLAGS
  • Different Murk versions (even patch releases may change propagator numerics)

Authoritative vs Non-Authoritative Paths

The authoritative path must be deterministic — any change here requires determinism test verification:

  • TickEngine: propagator execution, command application, generation staging
  • Propagator step() implementations and pipeline ordering
  • IngressQueue: command sorting and expiry
  • Snapshot publish (generation swap)
  • Arena allocation and recycling patterns

The non-authoritative path may vary between runs and must never affect world state:

  • Rendering, logging, and metrics collection
  • Wall-clock pacing and backpressure in RealtimeAsync mode
  • Egress worker scheduling and observation extraction timing
  • StepMetrics timing measurements
  • CLI tooling and debug output

Contributors: if your change touches the authoritative path, run the full determinism test suite (cargo test --test determinism) and verify snapshot hashes are unchanged.

Key Mechanisms

  • No HashMap/HashSet — banned project-wide via clippy. All code uses IndexMap/BTreeMap for deterministic iteration.
  • No fast-math — floating-point reassociation is prohibited in authoritative code paths.
  • Tick-based time — all state-affecting time references use tick counts, not wall clocks.
  • Deterministic command ordering — commands are sorted by priority class and source ordering, not arrival time.
  • Replay support — binary replay format records initial state, seed, and command log with per-tick snapshot hashes for divergence detection.

Known Footguns

Floating-point transcendentals: Even without fast-math, sin, cos, exp, and log can vary across platforms and libm implementations. Propagators using transcendentals in authoritative updates remain Tier B (same ISA/toolchain), but this is the most likely source of cross-platform divergence. If tighter guarantees are needed in future, a murk_math shim can provide consistent implementations.

Parallelism introduction: The current architecture is safe because propagators execute sequentially and the batched engine steps worlds in order. When parallel propagators or Rayon-based batched stepping are introduced, determinism tests must become thread-count invariant: run with 1, 2, and 8 threads, permute world ordering, and require identical snapshot hashes.

See determinism-catalogue.md for the full catalogue of non-determinism sources and mitigations.


Language Bindings

C FFI (murk-ffi)

Stable, handle-based C ABI:

  • Opaque handles (MurkWorld, MurkSnapshot, MurkObsPlan) with slot+generation for safe double-destroy.
  • Caller-allocated buffers for tensor output (no allocation on the hot path).
  • Versioned API with explicit error codes (current ABI: v3.0).
  • Panic-safe FFI boundary: all extern "C" entry points are guarded; panics return MurkStatus::Panicked (-128) instead of unwinding.
  • Panic diagnostics are retrievable via murk_last_panic_message.
  • MurkStepMetrics includes sparse observability counters: retired ranges, pending retired ranges, reuse hits, and reuse misses.

Python (murk-python)

PyO3/maturin native extension:

  • MurkEnv — single-environment Gymnasium Env adapter.
  • MurkVecEnv — vectorised environment adapter for parallel RL training.
  • BatchedWorld — batched PyO3 wrapper: steps N worlds and extracts observations in a single py.detach() call. Pointer addresses are cast to usize for the Ungil closure boundary.
  • BatchedVecEnv — pure-Python SB3-compatible vectorized environment with pre-allocated NumPy buffers, auto-reset, and override hooks for reward/termination logic.
  • Direct NumPy array filling via the C FFI path.
  • Python-defined propagators for prototyping.
  • FFI panic status (-128) maps to Python RuntimeError with the captured panic message.

Troubleshooting

For normal usage, prefer the published wheel:

python -m pip install murk

The maturin develop guidance below is for contributors working on Murk internals from a source checkout.

Build Issues

Contributor setup: maturin develop fails with “pyo3 not found”

Ensure you have a compatible Python version (3.12+) and that your virtual environment is activated:

python3 -m venv .venv
source .venv/bin/activate
pip install maturin
cd crates/murk-python
maturin develop --release

cargo build fails with MSRV error

Murk requires Rust 1.87 or later. Update with:

rustup update stable

Miri fails to run

Miri requires the nightly toolchain with the miri component:

rustup toolchain install nightly --component miri
cargo +nightly miri test -p murk-arena

Runtime Issues

Python import error: “No module named murk._murk”

If you installed from PyPI, reinstall in the active environment:

python -m pip install --upgrade --force-reinstall murk

If you are working from a source checkout, build the extension in your active virtual environment:

cd crates/murk-python
maturin develop --release

Determinism test failures

Determinism tests are sensitive to floating-point ordering. Ensure you’re running on the same platform and Rust version as CI. See Determinism Guarantees for details.

Import Issues

Contributor setup: maturin develop succeeds but import murk fails

The native extension was built, but Python cannot find it. Check the following:

  1. Virtual environment not activated. The extension is installed into the virtualenv that was active during maturin develop. Make sure you activate the same environment before importing:

    source .venv/bin/activate
    python -c "import murk; print(murk.__version__)"
    
  2. Python version mismatch. If you have multiple Python versions, maturin develop may have built against a different one. Verify:

    python --version          # should match the version used during build
    maturin develop --release # rebuild if in doubt
    
  3. Missing numpy. Murk requires numpy >= 1.24. If numpy is not installed, the extension may fail to load:

    pip install "numpy>=1.24"
    

Runtime Performance Issues

Simulation unexpectedly slow

If step throughput is much lower than expected, check these common causes:

  1. Debug mode. Ensure you built with --release. Debug builds are 10-50x slower:

    maturin develop --release    # Python
    cargo run --release           # Rust
    
  2. Propagator complexity. A Python propagator that does heavy per-cell work will bottleneck the tick. Profile with cProfile or py-spy to confirm.

  3. Observation extraction frequency. If you are calling observe() more often than you need observations, reduce the frequency. Each call copies data from the arena.

  4. Batched vs single env. For RL training with many environments, BatchedVecEnv steps all worlds in a single Rust call with one GIL release. MurkVecEnv releases the GIL N times. Switching to BatchedVecEnv can yield significant speedups at scale.

CI Failures

Tests pass locally but fail in CI

  1. Miri nightly version. Miri is pinned to a specific nightly toolchain. If CI uses a different nightly than your local machine, Miri behaviour may differ. Check the CI configuration for the expected nightly date:

    rustup toolchain install nightly-2025-12-01 --component miri
    cargo +nightly-2025-12-01 miri test -p murk-arena
    
  2. Platform differences. Floating-point results can vary across operating systems and CPU architectures. Murk targets Tier B determinism (same build + ISA + toolchain), so cross-platform mismatches are expected for bitwise comparisons.

  3. Test isolation. Some tests create temporary files or rely on ordering. If tests run in parallel and share state, they may fail non-deterministically. Use cargo test -- --test-threads=1 to check for isolation issues.

Simulation Behavior Issues

Results don’t match expectations

If the simulation produces unexpected output, check these common causes:

  1. Determinism catalogue. Review the Determinism Guarantees page. Some sources of non-determinism (hash ordering, threading, fast-math) are documented with mitigations.

  2. Propagator read/write declarations. If a propagator declares reads (Euler) but should use reads_previous (Jacobi), it will see partially-updated values from earlier propagators in the same tick. Double-check the read mode for each field.

  3. Timestep too large (CFL violation). If dt exceeds the CFL stability limit for your propagator, diffusion can blow up or oscillate. The engine checks topology-aware max_dt(space) at startup, but only if the propagator declares it. Reduce dt or add a max_dt declaration.