Just saw the latest Ollama update and thought it was worth sharing with the community. In this latest release, they have reworked how model scheduling and memory management works. It looks like it will be a step in the right direction for stability and performance.
Now, instead of estimating memory needs like it did before, Ollama now measures exactly how much memory a model needs before it runs it. That means:
-
Way fewer crashes from out-of-memory errors
-
Better GPU utilization (more memory goes where it’s needed, which boosts token speeds)
-
Smarter scheduling for multiple GPUs, even if they’re mismatched
-
Reporting in nvidia-smi now lines up with ollama ps, so the tracking usage there makes sense
A couple of example benchmarks they shared seem to show the difference:
gemma3:12b on a single RTX 4090 (128k context)
-
Old: 52 tokens/s → New: 85 tokens/s
-
Old: 48/49 layers on GPU → New: 49/49 layers on GPU
mistral-small3.2 with 2× RTX 4090s (32k context)
-
Old: 127 tokens/s prompt eval → New: 1380 tokens/s
-
Old: 43 tokens/s generation → New: 55 tokens/s
These enhancements are not included in all models as of yet. But the following ones have been enhanced: gpt-oss, llama4, gemma3, qwen3, mistral-small3.2, and the all-minilm embeddings. They mention that more are on the way.
From the numbers they have posted, this looks like a pretty huge leap forward, especially for anyone running long contexts or trying to squeeze performance out of multiple GPUs. You can read more about it here: New model scheduling · Ollama Blog
Anyone here tried this update yet? Curious if you’re seeing the same gains they’ve shown in the benchmarks.