SynapCores v1.8 — One binary. No daemon. Faster than CPU Ollama.

Published on June 8, 2026

SynapCores v1.8 — One binary. No daemon. Faster than CPU Ollama.

If you've been running SynapCores with an Ollama sidecar — two processes, two ports, two config files, two things to babysit — v1.8 changes the math. The engine now pulls, loads, and runs GGUF models in-process. One binary. One config. One thing to install.

And on apples-to-apples CPU comparison, it beats Ollama by a wide margin.

Install in 30 seconds

# 1. Download the binary
curl -fsSL https://get.synapcores.com | sh

# 2. Start the gateway — first boot auto-pulls the default model
synapcores --config gateway.toml

That's it. No Ollama install step. No two-process setup. The first time you start it, the gateway warms up qwen2.5-coder:7b (the default tool-capable chat model) and all-minilm:latest (for embeddings) in parallel — you'll see live download progress in the console.

The moment those finish, SELECT GENERATE('hello'), SELECT EMBED('hello'), and SELECT AGENT_RUN('aidb-assistant', '…') all work against the in-process local provider. No API keys. No external daemon.

What's actually new

1. In-process model registry — synapcores pull <model:tag>

We talk to registry.ollama.ai the same way Ollama does: OCI Distribution v2 protocol, multi-layer manifests, sha256-verified blob downloads, resume on interruption, idempotent re-pulls. Any GGUF model in Ollama's registry is one command away:

synapcores pull qwen2.5-coder:7b
synapcores pull llama3.2:3b
synapcores pull mistral:7b
synapcores pull all-minilm:latest

synapcores models list
# NAME                         ARCH    SIZE      PULLED
# library/qwen2.5-coder:7b     qwen2   4.4 GiB   12 minutes ago
# library/all-minilm:latest    bert    45.9 MiB  12 minutes ago

synapcores models delete qwen2.5-coder:0.5b
# deleted library/qwen2.5-coder:0.5b (reclaimed 379.4 MiB)

The model store is content-addressed (${data_dir}/models/blobs/<sha256>). Pulling the same model twice short-circuits at the manifest check. Two models sharing layers (e.g. base + LoRA variants) de-duplicate automatically.

2. New SQL surface for model lifecycle

You can manage models from SQL itself — useful for recipes, automation, and the in-DB agentic runtime that wants to provision its own resources:

SELECT PULL_MODEL('qwen2.5-coder:7b');
-- "installed library/qwen2.5-coder:7b (architecture=qwen2, size=4.4 GiB, ...)"

SELECT * FROM LIST_MODELS();
-- columns: name, architecture, size_bytes, digest, pulled_at, last_used_at

SELECT DELETE_MODEL('library/qwen2.5-coder:0.5b');
-- "removed digest=sha256:..."

All three are documented in the engine's MCP sql_manual tool (so MCP clients like Claude Code, Cursor, and OpenClaw can discover them at runtime) and the public SQL reference on synapcores.com.

3. Beats apples-to-apples CPU Ollama

This one is honest and load-bearing. We benchmarked carefully:

Metric Ollama (CPU only, same box) SynapCores v1.8 (CPU)
20-recipe agent cert wall-clock 65:08 — did not complete 39:43 — 20/20 PASS
Recipe pass count 0/20 (every call timed out at Ollama's internal 2-min limit) 20/20
First EMBED call <100 ms 57 ms
First GENERATE smoke call n/a (couldn't complete) 5.2 s warm

The Ollama "baseline" floating around in older docs (38 min cert) was almost certainly GPU-accelerated. When you constrain Ollama to CPU on the same hardware we benchmark on (i5-10400F, no AVX-512, no GPU offload), our in-process runtime is the only one that actually completes the cert.

Why? A few real architectural wins, layered:

  • Persistent LlamaContext sessions per chat persona. The KV cache survives across AGENT_RUN calls. Hot-path token prefill drops to zero on 91% of calls.
  • OpenMP thread runaway, fixed. libgomp on a 12-logical / 6-physical CPU spawns 12-thread teams by default. Two threads per physical core fight for the same AVX2 unit, killing throughput. We cap it at physical cores, disable GGML_OPENMP at build time, and fall back to llama.cpp's native thread pool.
  • Sampler chain reordered to llama.cpp's canonical order. Saves a per-token full-vocab sort.

4. Model-authoritative tool calling

This is the architectural win we're most excited about.

When Ollama runs a model, it uses the model author's own tokenizer.chat_template to render prompts — including tool definitions. The model emits its native tool-call format (Hermes <tool_call> for Qwen, <|python_tag|> for Llama 3, [TOOL_CALLS] for Mistral, etc.). Ollama's runtime parses that back into structured tool calls.

For v1.8 we render the GGUF's embedded Jinja chat template in Rust — using the minijinja crate — and feed it {messages, tools, add_generation_prompt} exactly the way llama.cpp's C++ side does. The result: every GGUF model with a tool-aware chat template just works, without us hard-coding a per-architecture switch.

Live verification on this box:

SELECT AGENT_RUN('technical_advisor',
                 'Show me the schema of all tables in this database');
-- "Sure! The schema of the table `hw_train` is as follows:
--   - x: Double, can be NULL
--   - y: Double, can be NULL"

Real ReAct loop. Real tool execution against a real database. Real prose answer. No "tool-call returned as text" silent failures.

For older or non-instruction-tuned GGUFs without a tool-aware template, we fall back to a per-family Rust renderer that knows Qwen / Llama 3 / Mistral / generic-JSON formats. Nothing's worse than v1.7 — it's strictly better.

5. Streaming, sessions, and observability

The LocalInferenceEngine exposes:

  • complete_in_session(session_id, model, msgs, params) — persistent context per session, KV cache reuse via prefix matching, LRU eviction at a configurable resident-memory cap
  • Streaming chunk channel via mpsc (token-by-token responses for sub-second perceived latency on long generations)
  • Per-call tracing: local-provider: chat_with_tools with prompt-token count, completion-token count, elapsed-ms, prefix-reuse status — visible at RUST_LOG=info

In a ReAct loop, sessions are keyed by persona:{name}:tenant:{id} so every AGENT_RUN call for the same persona on the same tenant hits the same hot context. Measured prefix-reuse rate in the cert: 91%.

What's coming in v1.8.x and v1.9

We held a few things back to keep v1.8 shippable:

  • GPU offload (CUDA / Metal). llama-cpp-2 already supports it behind a feature flag. v1.8.1 candidate. Will close the per-call gap vs GPU-accelerated Ollama for users with RTX / M-series silicon.
  • First-class agent memory verbsAGENT_MEMORY_ADD, RECALL, FORGET, UPSERT as SQL + matching MCP tools (remember, recall, forget). Design doc at docs/proposals/agent_memory_upsert_sql.md. v1.9 candidate.
  • Cross-encoder reranker for the chat-agent retrieval pipeline. Design at docs/proposals/agent_memory_reranker.md.
  • LongMemEval / LoCoMo eval harness so memory architecture changes are measurable, not anecdotal. Design at docs/proposals/agent_memory_eval_harness.md.

Try it now

# Linux x86_64 / ARM64
curl -fsSL https://get.synapcores.com | sh

# Docker
docker run -p 8080:8080 ghcr.io/synapcores/community:v1.8.0-ce

If you've been waiting for a database that runs LLM inference in-process without sacrificing speed to do it — this is the release.

Questions, bugs, or "what about model X" requests: github.com/mataluis2k/aidb/issues.

— The SynapCores team