Why GPT-OSS‑120B Feels Slow on a MacBook Pro M4 Max (128GB)

1) Model size vs. unified memory headroom
GPT‑OSS‑120B is enormous. Even with modern low‑precision formats, a 120B model routinely occupies tens of gigabytes in VRAM/RAM, and practical deployments keep additional copies/buffers for KV cache, activations, and tensor parallel bookkeeping. Dharmesh Shah notes the 120B weights are ~65 GB on disk—before runtime overheads and KV cache growth per token—so a 128 GB unified memory system can be quickly saturated during real usage, forcing memory pressure and paging that crush throughput.
2) Memory bandwidth is the real bottleneck for inference
Large LLM inference is largely memory‑bandwidth bound: generating each token requires streaming large weight matrices and growing KV caches. When bandwidth is the limiter, adding compute cores matters less than how fast you can move data. The 14‑inch/16‑inch M4 Max offers 410–546 GB/s unified memory bandwidth depending on configuration—excellent for a laptop, but still far below top desktop/workstation GPUs. This gap shows up acutely with 100B+ models.
3) Apple Silicon excels at efficiency, not 120B‑class throughput
M‑series Macs run many 7B–30B models comfortably and even some 70B models in optimized formats. But users commonly report that stepping up to 100B+ brings sharp slowdowns without specialized server‑class bandwidth and memory capacity. Even power users running very large open‑weight models note substantial tuning is needed to approach closed‑model performance, and that the gap remains.
4) Quantization and precision choices impact speed and quality
Running 120B locally almost always requires aggressive quantization to fit and go fast. Each format (e.g., MXFP4, 4‑bit, 6‑bit, FP8/FP16 mixes) trades off memory footprint, bandwidth needs, and quality. If you’re using a safer but heavier precision or a build lacking Apple‑specific kernels, you’ll see slower token/sec and possible thermal throttling under sustained load. NVIDIA documents MXFP4 as an efficiency win for RTX, illustrating how much optimized low‑precision paths matter for speed; the same principle applies on Mac.
5) Tooling and kernels may be less optimized on macOS for 120B
Performance varies by runtime (Ollama, llama.cpp, vLLM‑Metal builds, PyTorch MPS) and whether they leverage Apple’s Metal Performance Shaders effectively for your exact quantization. If your stack isn’t using the latest Metal‑optimized kernels and operator fusions for 120B‑scale models, you’ll leave a lot of performance on the table. Community guides show it’s easy to get 20B running smoothly on recent Macs, but that doesn’t guarantee 120B will be fast with the same settings.
6) Thermal and power limits over long sessions
Sustained inference at 120B drives continuous high bandwidth/compute. Laptops have limited thermal headroom; after a few minutes, clocks can dip, reducing tokens/sec. That’s one reason desktops with higher sustained cooling or dedicated GPUs still dominate at this scale, even if short bursts look fine.
7) Reality check: 120B is “frontier class” and expects workstation‑grade I/O
Vendor notes around gpt‑oss emphasize high‑end GPUs and large VRAM for top performance (e.g., RTX 5090 hitting >200 tokens/sec under tuned conditions). Your Mac can run the model, but not at similar throughput due to memory bandwidth, quantization paths, and thermal/power envelopes.
Practical tips to speed things up on your M4 Max
Choose a lighter build: try GPT‑OSS‑20B or a highly optimized 30B–70B model for most tasks; many users find these far more responsive on Macs while retaining strong quality.
Use the most optimized runtime available: keep Ollama/llama.cpp up to date, and use Metal‑optimized builds with the best available 4‑bit formats for Mac.
Reduce context length and tools: long contexts balloon KV cache size; trimming prompts and disabling unnecessary tool calls can materially increase tokens/sec.
Close other memory‑hungry apps: free unified memory reduces paging and throttling.
External cooling: a laptop stand or active cooling helps maintain boost clocks during long runs.
Consider remote inference for 120B+: if you need 120B throughput, a remote GPU box or cloud can deliver workstation‑class bandwidth while you keep the Mac for development and orchestration.
Why your experience differs from others
Some posts celebrate getting 120B to “run” locally on Macs, but “runs” isn’t “runs fast.” Disk footprint (e.g., ~65 GB) doesn’t reflect runtime bandwidth and cache demands. The excitement is about feasibility and openness, not parity with top GPUs.
Bottom line
Your M4 Max 128 GB can host GPT‑OSS‑120B, but it will feel slow because 120B inference is dominated by memory bandwidth and cache growth that exceed laptop‑class limits. For practical speed, use smaller or more aggressively‑quantized models, or offload 120B to hardware with much higher memory bandwidth.
Ready to make local AI fast and reliable for your workflow? Tenten can help you pick the right models, optimize your Mac setup, and integrate hybrid local/cloud inference so you get speed without compromises. Book a meeting: https://tenten.co/contact






