Decentralized inference has one hard problem. A language model generates text one token at a time, and each token has to pass through every layer of the model in order. Split those layers across machines and the data has to cross the network at each step, so the hardware sits idle waiting on the wire and a model that should answer in a second takes a minute. Get that part wrong and you have a demo, not a service.

Circuit's Decentralized LLM is built to be decentralized and fast at the same time. It is a Python inference engine, written from scratch, running a 32-billion-parameter model split by layer across two separate GPUs over an encrypted link. It decodes coherently, you can chat with it right now, and its programmatic endpoint serves paid inference in CIRC. The rule behind every design choice is simple: cross the network as rarely as possible per token. Here is how that is done.


Two GPUs, one model

The engine takes one model and splits it across separate GPUs, so each machine runs only its slice of the whole. The live system runs a 32-billion-parameter model across two GPUs this way: a real model, big enough to be a serious test, producing correct and coherent inference over a genuine network link. Splitting a single model cleanly across independent machines is the hard problem in decentralized inference, and here it works.

The cut is by layer: the first half of the model's layers run on one GPU, the second half on the other, with a coordinator on the first that handles the shared pieces and the sampling. The two run as independent machines, and the model's hidden state crosses an encrypted link between them on each pass.

        ┌──────────── GPU 1 (L4) ─────────────┐
 user → │  coordinator: embed, head, sample    │
        │  + small draft model                 │
        │  STAGE 0:  layers 0-31               │
        └──────────────────┬───────────────────┘
                           │  one encrypted hop over the internet
                           ▼
        ┌──────────── GPU 2 (L4) ─────────────┐
        │  STAGE 1:  layers 32-63              │
        └──────────────────┬───────────────────┘
                           │  direct return to the coordinator
                           ▼
                       coordinator  →  next tokens

A transformer is sequential: layer N needs layer N-1's output, so a token cannot skip ahead, and every machine boundary it crosses is a network round-trip. The split is arranged so the forward pass crosses the network once per token, not once per layer, which is what keeps it fast enough to use. This is live today: the 32B decodes coherently across the two separate cards, and you can chat with it on the site.

Two GPUs is the smallest version, not the ceiling. The same layer-cut extends across more machines, which is exactly what lets the network serve models far too large for any single card: a 70B-class model spread across several GPUs that each hold a slice. The split running today is the foundation that scales straight up to them.


Predictive drafting: small models running ahead

The split leaves exactly one network crossing in the per-token loop: the round-trip to the second card and back. A plain greedy decode waits on that hop for every token, which holds it to about 10 tokens per second. Predictive drafting hides that hop: a small draft model runs ahead and proposes the next few tokens, the full model verifies them in a single pass, and the output is identical to normal decoding, only faster.

Here the draft is a small model running locally on the first GPU with no network of its own. It proposes the next several tokens, the full 32B verifies them all in a single round-trip, and the system keeps every token the big model agrees with. One hop now yields several tokens instead of one. This is live: with drafting on, the 32B runs at about 13 to 14 tokens per second, and the gain grows the more often the draft guesses right.

The output is identical to plain greedy decoding, token for token. The big model's own prediction decides every committed token; the draft only changes how many get confirmed per hop, never which tokens they are. The engine's tests prove this for any draft, right or wrong.

The single local draft is the floor, not the ceiling. The next step widens it into a drafting forest. Checking a batch of guesses costs the big model almost as little as checking one, so it can weigh a whole tree of candidate continuations at once. Spread the drafting across many small models, each on a different node, and every one of them adds a branch:

   big model (the verifier, on the GPU)  ── checks a TREE of guesses per pass
        ▲ guesses (tiny messages)        │ accepted path
        │                                ▼
   ┌────────────── small draft models (the forest) ──────────────┐
   │  one runs LOCAL on the GPU  → puts a floor under speed       │
   │  others run on remote nodes → widen the tree, best-effort    │
   └─────────────────────────────────────────────────────────────┘

This is safe to decentralize because the GPU keeps its own local draft as a floor, so speed never depends on the network. Drafts from remote nodes are a bonus: used when they arrive in time, ignored when they are late, so a slow or disconnected drafter can only help, never hurt. The local draft runs today; the forest is built on it.


The node-client mesh

Every machine in the network runs the same node client and joins a single mesh; what it does there depends on its hardware. A node with a capable GPU holds part of the model and does the heavy work of generating tokens. A node with only a CPU runs the network layer that keeps the mesh coordinated: a cryptographic identity, registry announcements and heartbeats so the mesh knows who is online, and the swarm's signal traffic. Same network, different jobs by hardware.

Six CPU nodes are live today, running that coordination layer. As the mesh opens to outside operators, nodes take on two further paid roles, again decided by what each can do:

  request ─► CPU NODE (route, verify x402) ─► GPU NODE (holds model, generates) ─► tokens
             meanwhile, CPU nodes can also draft, feeding guesses into the GPU's forest

The GPUs carry the model and the CPU nodes carry the network, and together they make one mesh instead of a private API.


The models

The model live across the two GPUs today is a capable, general-purpose 32B.

The model that matters most for what Circuit does is the one being trained in-house, on Solana data: on-chain activity, token and program behavior, the patterns an agent operating on Solana actually needs to reason about. A general model knows a little about everything and nothing about the chain it trades on. A Solana-native model is built for the environment the agents live in, and the network underneath it is built to serve it.

None of this is tied to one specific model. The engine reads a model's shape and splits it across whatever GPUs are available, so swapping in a different model, or the Solana one once it is trained, changes nothing about how the rest works.


Who uses it, and how they pay

The endpoint is OpenAI-compatible, so it takes two kinds of caller the same way. Circuit's own agents call it to reason about trades, markets, and signals instead of renting a centralized model, and a Solana-native model makes that sharper, since the agents would be asking a model trained on the exact environment they work in. Any outside developer can point a standard client at it as well. From the caller's side it looks like an ordinary chat API; underneath, a decentralized network produces the tokens.

There are two doors in. The chat on this site is a free, open demo: try the model in the browser, rate-limited, no wallet needed. The programmatic endpoint at inference.circuitllm.xyz is the paid one, and it is live now. It runs on x402, the same HTTP-402 micropayment flow the Circuit data API uses: call /v1/chat/completions, get a 402 back carrying a price quote in CIRC, send that CIRC on-chain as a Token-2022 transfer, then re-send the request with the transaction signature attached and the model answers. A call settles for a fraction of a cent, about $0.002 today, per request, with no account and no API key, and each payment signature is single-use. The free demo lets anyone watch it work; the paid endpoint is how agents and applications actually consume it.

Those payments are the network's revenue. How they reach the operators who earn them, by the work they do and the stake they hold, is the next section.


Stake, serve, earn

Running a node is designed as a menu of roles, not a single one, with a productive slot for almost any hardware.

Whatever the role, an operator carries a staked CIRC wallet, which registers their address on the network's distribution list. The inference fees paid in CIRC flow into the network treasury and are paid out to staked wallets on a fixed thirty-minute cycle, so an operator earns from both sides: the work their hardware does, and the stake that puts them on the list. (The full economics of staking and distribution are covered in the network holdings article.)

We expect most operators to restake the CIRC they earn. The same stake that earns the distribution is the entry ticket to serving more of the network, so compounding it is the natural move, and the CIRC loops back in rather than leaving.


Why it matters

Centralized inference is one company deciding who gets access and at what price, on hardware you will never see. The DLLM is the other model: a real 32B split across independent GPUs over an encrypted link, kept fast by predictive drafting, routed through a mesh of independent nodes, and paid for in fractions of a cent on-chain. The hardest part of decentralized inference, the network sitting in every token's path, is the first thing this design removes, and the same split is what unlocks the models too big for any single card.

A model's layers split across separate machines, made fast by small models anyone can contribute, served by a mesh that pays everyone who helps. That is the DLLM.


Circuit LLM is experimental software. The decentralized inference engine, the two-GPU 32B split, live predictive drafting, and x402-paid inference at inference.circuitllm.xyz run on real hardware today, alongside the free public chat; the drafting forest (many drafts spread across remote nodes), the routing mesh, open third-party node participation, and the Solana-native model are in active development. The speeds, model sizes, and staking parameters described here are targets that may change before release. Nothing here is financial advice.