SynapCores v1.8 — One binary. No daemon. Faster than CPU Ollama.
If you've been running SynapCores with an Ollama sidecar — two processes, two ports, two config files, two things to babysit — v1.8 changes the math. The engine now pulls, loads, and runs GGUF models in-process. One binary. One config. One thing to install.
And on apples-to-apples CPU comparison, it beats Ollama by a wide margin.
Install in 30 seconds
# 1. Download the binary
curl -fsSL https://get.synapcores.com | sh
# 2. Start the gateway — first boot auto-pulls the default model
synapcores --config gateway.toml
That's it. No Ollama install step. No two-process setup. The first time you start it, the gateway warms up qwen2.5-coder:7b (the default tool-capable chat model) and all-minilm:latest (for embeddings) in parallel — you'll see live download progress in the console.
The moment those finish, SELECT GENERATE('hello'), SELECT EMBED('hello'), and SELECT AGENT_RUN('aidb-assistant', '…') all work against the in-process local provider. No API keys. No external daemon.
What's actually new
1. In-process model registry — synapcores pull <model:tag>
We talk to registry.ollama.ai the same way Ollama does: OCI Distribution v2 protocol, multi-layer manifests, sha256-verified blob downloads, resume on interruption, idempotent re-pulls. Any GGUF model in Ollama's registry is one command away:
synapcores pull qwen2.5-coder:7b
synapcores pull llama3.2:3b
synapcores pull mistral:7b
synapcores pull all-minilm:latest
synapcores models list
# NAME ARCH SIZE PULLED
# library/qwen2.5-coder:7b qwen2 4.4 GiB 12 minutes ago
# library/all-minilm:latest bert 45.9 MiB 12 minutes ago
synapcores models delete qwen2.5-coder:0.5b
# deleted library/qwen2.5-coder:0.5b (reclaimed 379.4 MiB)
The model store is content-addressed (${data_dir}/models/blobs/<sha256>). Pulling the same model twice short-circuits at the manifest check. Two models sharing layers (e.g. base + LoRA variants) de-duplicate automatically.
2. New SQL surface for model lifecycle
You can manage models from SQL itself — useful for recipes, automation, and the in-DB agentic runtime that wants to provision its own resources:
SELECT PULL_MODEL('qwen2.5-coder:7b');
-- "installed library/qwen2.5-coder:7b (architecture=qwen2, size=4.4 GiB, ...)"
SELECT * FROM LIST_MODELS();
-- columns: name, architecture, size_bytes, digest, pulled_at, last_used_at
SELECT DELETE_MODEL('library/qwen2.5-coder:0.5b');
-- "removed digest=sha256:..."
All three are documented in the engine's MCP sql_manual tool (so MCP clients like Claude Code, Cursor, and OpenClaw can discover them at runtime) and the public SQL reference on synapcores.com.
3. Beats apples-to-apples CPU Ollama
This one is honest and load-bearing. We benchmarked carefully:
| Metric | Ollama (CPU only, same box) | SynapCores v1.8 (CPU) |
|---|---|---|
| 20-recipe agent cert wall-clock | 65:08 — did not complete | 39:43 — 20/20 PASS |
| Recipe pass count | 0/20 (every call timed out at Ollama's internal 2-min limit) | 20/20 |
| First EMBED call | <100 ms | 57 ms |
| First GENERATE smoke call | n/a (couldn't complete) | 5.2 s warm |
The Ollama "baseline" floating around in older docs (38 min cert) was almost certainly GPU-accelerated. When you constrain Ollama to CPU on the same hardware we benchmark on (i5-10400F, no AVX-512, no GPU offload), our in-process runtime is the only one that actually completes the cert.
Why? A few real architectural wins, layered:
- Persistent
LlamaContextsessions per chat persona. The KV cache survives across AGENT_RUN calls. Hot-path token prefill drops to zero on 91% of calls. - OpenMP thread runaway, fixed. libgomp on a 12-logical / 6-physical CPU spawns 12-thread teams by default. Two threads per physical core fight for the same AVX2 unit, killing throughput. We cap it at physical cores, disable GGML_OPENMP at build time, and fall back to llama.cpp's native thread pool.
- Sampler chain reordered to llama.cpp's canonical order. Saves a per-token full-vocab sort.
4. Model-authoritative tool calling
This is the architectural win we're most excited about.
When Ollama runs a model, it uses the model author's own tokenizer.chat_template to render prompts — including tool definitions. The model emits its native tool-call format (Hermes <tool_call> for Qwen, <|python_tag|> for Llama 3, [TOOL_CALLS] for Mistral, etc.). Ollama's runtime parses that back into structured tool calls.
For v1.8 we render the GGUF's embedded Jinja chat template in Rust — using the minijinja crate — and feed it {messages, tools, add_generation_prompt} exactly the way llama.cpp's C++ side does. The result: every GGUF model with a tool-aware chat template just works, without us hard-coding a per-architecture switch.
Live verification on this box:
SELECT AGENT_RUN('technical_advisor',
'Show me the schema of all tables in this database');
-- "Sure! The schema of the table `hw_train` is as follows:
-- - x: Double, can be NULL
-- - y: Double, can be NULL"
Real ReAct loop. Real tool execution against a real database. Real prose answer. No "tool-call returned as text" silent failures.
For older or non-instruction-tuned GGUFs without a tool-aware template, we fall back to a per-family Rust renderer that knows Qwen / Llama 3 / Mistral / generic-JSON formats. Nothing's worse than v1.7 — it's strictly better.
5. Streaming, sessions, and observability
The LocalInferenceEngine exposes:
complete_in_session(session_id, model, msgs, params)— persistent context per session, KV cache reuse via prefix matching, LRU eviction at a configurable resident-memory cap- Streaming chunk channel via
mpsc(token-by-token responses for sub-second perceived latency on long generations) - Per-call tracing:
local-provider: chat_with_toolswith prompt-token count, completion-token count, elapsed-ms, prefix-reuse status — visible atRUST_LOG=info
In a ReAct loop, sessions are keyed by persona:{name}:tenant:{id} so every AGENT_RUN call for the same persona on the same tenant hits the same hot context. Measured prefix-reuse rate in the cert: 91%.
What's coming in v1.8.x and v1.9
We held a few things back to keep v1.8 shippable:
- GPU offload (CUDA / Metal).
llama-cpp-2already supports it behind a feature flag. v1.8.1 candidate. Will close the per-call gap vs GPU-accelerated Ollama for users with RTX / M-series silicon. - First-class agent memory verbs —
AGENT_MEMORY_ADD,RECALL,FORGET,UPSERTas SQL + matching MCP tools (remember,recall,forget). Design doc atdocs/proposals/agent_memory_upsert_sql.md. v1.9 candidate. - Cross-encoder reranker for the chat-agent retrieval pipeline. Design at
docs/proposals/agent_memory_reranker.md. - LongMemEval / LoCoMo eval harness so memory architecture changes are measurable, not anecdotal. Design at
docs/proposals/agent_memory_eval_harness.md.
Try it now
# Linux x86_64 / ARM64
curl -fsSL https://get.synapcores.com | sh
# Docker
docker run -p 8080:8080 ghcr.io/synapcores/community:v1.8.0-ce
If you've been waiting for a database that runs LLM inference in-process without sacrificing speed to do it — this is the release.
Questions, bugs, or "what about model X" requests: github.com/mataluis2k/aidb/issues.
— The SynapCores team