Finally, Your LLM Will Become Deterministic at Temperature 0 !
Quick Insights - "Defeating Nondeterminism in LLM Inference" by Horace He in collaboration with others at Thinking Machines
Why Your LLM at Temperature 0 Isn’t Deterministic (and How to Fix It)
Have you ever wondered why your LLM, even at temperature = 0 - where it should be theoretically deterministic - still gives you different answers to the same question? You ask once, then again, and somehow the outputs don’t always match.
Horace He at Thinking Machines Lab just published a fantastic explanation and solution in his blog post: “Defeating Nondeterminism in LLM Inference.” Here are the key insights:
Why Temperature = 0 Should Be Deterministic, But Isn’t
Greedy decoding means always picking the highest-probability token. No randomness, no sampling.
In theory, this guarantees the same result every time.
In practice, subtle numerical and batching effects sneak in nondeterminism.
Sources of Nondeterminism
Floating-point non-associativity
Floating-point math isn’t perfectly consistent:(a + b) + ccan differ froma + (b + c). On GPUs, reductions can run in different orders, causing tiny shifts in logits, sometimes enough to flip which token is chosen.Batch-size variance
The central culprit highlighted by Horace. Inference typically runs in batches for efficiency. But batch size, order, and composition may vary between runs. This changes how numerical errors accumulate, leading to different logits and token choices.Kernel implementations
Many GPU kernels are not batch-invariant. That means the computation for one example depends on what else is in the batch. This hidden dependency breaks determinism.Other factors
Near ties in probabilities → tiny numeric noise determines which token “wins.”
Hardware/software differences (GPU versions, memory layout, scheduling).
The Fix: Batch-Invariant Kernels
The proposed solution: use batch-invariant kernels. These ensure a single sequence’s computation is unaffected by the batch context. With
batch-invariant operations for reductions, attention, and normalization, Horace shows it’s possible to achieve truly deterministic inference at temperature = 0 even with vLLM.
Trade-off: some performance overhead. Upside: reliable reproducibility.
Why This Matters
Research reproducibility: Findings should not hinge on hidden randomness.
Production stability: Determinism makes debugging and testing far easier.
Feedback loops (RL, agent systems): Stability depends on consistent outputs.
Summary: Temperature = 0 alone doesn’t guarantee determinism. To truly lock down reproducibility, you also need batch-invariant kernels.
Full post here: Defeating Nondeterminism in LLM Inference by Horace He (Thinking Machines Lab).


