Most gateways sit in front of the models and treat inference as a black box. This one is built from the inference engine out: one routing contract turns signals into projections, projections into decisions, and decisions into the model — across a mesh of local, private and frontier engines — while the prefix cache is protected, context is selected rather than pasted, and every turn takes the least-cost path that still meets the need. Session-aware across long-running agents, sandboxed for tool safety, and shadow-tested before any policy goes live. OpenAI- and Anthropic-compatible, multimodal, governed like production, on hardware you own.
Intent, complexity, modality and risk become projections and policy bands, then route across a local-to-frontier mesh — reasoning only when it pays.
Stateful guards keep multi-turn agents coherent: hard locks block unsafe model switches mid-tool-loop, weighing quality gap, prefix locality and turn priors.
Tools and code run in policy-governed MicroVM sandboxes — no unauthorized file, credential or network access.
Score intent, risk, modality, context.
Normalise into policy bands.
Pick model, agent or tool.
Sandboxed, observed, metered.
A control plane governs policy, identity and guardrails; a data plane serves fast, observable, cost-aware inference; and a self-improving router between them turns every request into a better next decision — protecting the prefix cache and selecting context so the work stays cheap as it grows.
Signals become projections, projections drive decisions, decisions choose the model — the same pipeline whether configured in YAML, the console, the CLI or Kubernetes.
Token- and capability-aware routing spans self-hosted engines, local SLMs and frontier APIs with semantic caching; classifiers run on any accelerator — one control plane, any backend.
History-aware PII, jailbreak and prompt-injection scanning across every turn — behind an OpenAI- and Anthropic-compatible ingress with explicit, lossless translation.
Stable prompt epochs, deterministic tool-schema ordering and bounded, append-only context keep reusable prefixes intact — so cached tokens are reused across a long session at a fraction of the price instead of re-billed every turn.
Every routing policy is versioned and shadow-tested on replayed traffic before activation, with one-click rollback — routing never drifts silently.
Text, voice, image and event inputs are normalised, routed to the right modality model, and turned into grounded responses or tool actions.
Graph-shaped code evidence, bounded tool output and domain-aware compression extract the signal a turn actually needs and drop the rest — fewer prompt and tool-output tokens, without losing continuity across a long task.
A console traces every signal → projection → decision with replay, and a live ledger shows cache reuse, context savings and per-route latency, tokens and cost — spend is accountable while the task runs, not after.
The architecture and the economics behind this platform — read in the browser or export to PDF.
For infrastructure leaders: turning unpredictable, metered AI opex into fixed, predictable cost for the industrial edge.
The frozen-base doctrine: adapting custom models on the edge through context and self-verification, not weight updates.
Turnkey Edge-AI — fixed time, fixed cost, full responsibility.