The most economical ways to run gpt‑oss‑120B

1) Use a single enterprise GPU (H100‑80GB) with MXFP4 for best $/throughput
GPT‑OSS‑120B ships with 4‑bit MXFP4 weights and a Mixture‑of‑Experts (MoE) design with ~5.1B active params per token, enabling efficient inference on a single 80 GB data‑center GPU like NVIDIA H100 instead of multi‑GPU clusters, dramatically cutting hardware and hosting costs compared to dense 120B models. Use vLLM or Transformers with MXFP4 kernels for top throughput and utilization.
Why this is cheap in practice
One H100 instance replaces 2–4 high‑end GPUs for comparable quality due to MoE sparsity and optimized 4‑bit inference paths.
vLLM plus MXFP4 kernels give high tokens/sec and good TTFT, reducing runtime (and therefore cloud bill).
2) Burst to managed one‑click deployments when you need scale (pay only when used)
If you don’t want to own hardware, use a managed “one‑click” stack (vLLM + Open WebUI) that targets H100 or equivalent; you pay for GPU hours only when serving traffic, and you skip DevOps overhead. It’s a fast, low‑risk way to validate workloads and keep costs variable rather than fixed CAPEX.
3) Prefer the 20B model for most workloads to slash costs 5–10x
For many applications (coding assistance, RAG, tools), GPT‑OSS‑20B in MXFP4 fits in ~16 GB VRAM and flies on a single consumer GPU or NPU‑equipped device, cutting infra cost by an order of magnitude while maintaining strong quality. If your tasks don’t strictly require 120B reasoning headroom, this swap is the biggest single cost saver.
4) Optimize your software stack for MXFP4 and batching
Economy comes from throughput, not just raw price.
Use Transformers/vLLM with MXFP4 kernels and enable paged KV cache, efficient batching, and streaming.
Keep context windows no larger than needed—KV cache memory balloons cost and hurts tokens/sec, especially at 120B scale.
Serve via an OpenAI‑compatible endpoint using “transformers serve” to reuse clients without rewriting.
5) Run locally only on hardware with sufficient bandwidth; avoid memory‑bound setups
120B‑class models are bandwidth‑bound. Consumer multi‑GPU rigs can work but get expensive quickly (boards, power, cooling) and typically underperform a single H100 in efficiency. A practical rule: if you can’t guarantee fast memory bandwidth and enough RAM/VRAM for weights + KV cache, you’ll pay more per token due to slow throughput.
6) Consider Windows/edge options only if you already own the hardware
AMD reports GPT‑OSS‑120B running at up to ~30 tok/s on Ryzen AI Max+ 395 with 128 GB memory using GGML‑converted MXFP4 weights. If you already have such a machine, that’s a cost‑effective “free” inference path; if not, buying one solely for 120B rarely beats renting H100 time. Use 20B on more common client devices for much better economics.
7) Cloud procurement tips to minimize cost
Choose spot/preemptible H100 instances and autoscaling to match traffic.
Co‑locate storage near compute and pre‑load weights to reduce cold‑start egress.
Use a single high‑throughput model replica before scaling horizontally; better batching usually beats more replicas for cost.
8) When 120B isn’t strictly necessary, downshift for ROI
Even OpenAI and ecosystem docs position 20B for consumer/on‑device and 120B for heavier reasoning; benchmark your use case—many teams find 20B or a good 30–70B MoE meets SLAs at a fraction of the cost.
Minimal reference setup (cost‑efficient)
Hardware: 1× H100 80 GB (cloud) or equivalent enterprise GPU.
Runtime: vLLM or Transformers with MXFP4 kernels, OpenAI‑compatible serving.
Ops: Turn on batching, streaming, and limit context; monitor tokens/sec and TTFT to tune batch size.
When to choose managed platforms
- You want to avoid DevOps and get a predictable serving stack with vLLM + UI + autoscaling in minutes, paying only for GPU time.
When to choose local/edge
- You already own capable hardware (e.g., Ryzen AI Max+ 395 128 GB) and your workload tolerates client‑side latency and power usage. Otherwise, use 20B locally and keep 120B in the cloud.
Bottom line
The most economical path for gpt‑oss‑120B is a single H100‑80GB with MXFP4 and an optimized runtime (vLLM/Transformers), preferably via a managed, on‑demand stack so you only pay when you serve traffic. If you don’t truly need 120B, run 20B instead for 5–10x lower cost while retaining strong performance for most tasks.
GPT‑OSS‑120B Inference Speed Comparison: AMD Ryzen AI Max+ 395 vs Apple M4 Max (128GB) vs NVIDIA H100
The table below contrasts real-world and vendor-reported throughput, memory fit, and practical notes when running gpt‑oss‑120B on three popular platforms. Where available, figures are directly cited from vendor blogs and benchmark pages; for M4 Max, community experiences indicate feasibility but generally slower, bandwidth‑limited performance compared to datacenter GPUs.
| Platform | Reported/Expected Throughput (tokens/s) | Memory Fit & Precision | Runtime Stack Notes | Strengths | Limitations |
| AMD Ryzen AI Max+ 395 (128GB, Radeon 8060S iGPU) | Up to ~30 tok/s (LM Studio, GGML‑converted MXFP4) | GGML MXFP4 ~61 GB weights fit into up to 96 GB converted VRAM; requires Adrenalin 25.8.1+ | LM Studio / llama.cpp with AMD Variable Graphics Memory; Windows | First consumer CPU+iGPU platform publicly shown running GPT‑OSS‑120B with usable speed; MCP support enabled by large memory | Laptop/edge thermals; far below datacenter GPU throughput; Windows driver dependency |
| Apple M4 Max (128GB unified memory) | Feasible but typically slower than Ryzen AI Max+ 395 and much slower than H100; highly memory‑bandwidth bound at 120B scale | 4‑bit weights may fit in unified memory, but KV cache growth and bandwidth limit tokens/s | Metal (llama.cpp/Ollama/Transformers MPS); macOS | Excellent for 20B–70B; great efficiency | Unified memory bandwidth and thermals constrain 120B; less mature kernels for ultra‑large MoE paths |
| NVIDIA H100 80GB (datacenter GPU) | Single‑GPU: fast, commonly tens to low hundreds tok/s depending on stack/batch; Blackwell GB200 NVL72 reference shows 1.5M TPS (system-level) | Ships in FP4/MXFP4; 120B fits on a single 80 GB GPU; trained on H100 | vLLM / TensorRT‑LLM / Transformers; highly optimized kernels | Best $/throughput at scale; strong batching, low TTFT; production‑grade | Requires datacenter class hardware or cloud rental; higher absolute cost vs client devices |
Key details and sources
AMD states the Ryzen AI Max+ 395 (128 GB) can run GPT‑OSS‑120B using GGML‑converted MXFP4 weights that occupy roughly 61 GB, fitting into up to 96 GB VRAM via AMD Variable Graphics Memory, achieving up to ~30 tokens/sec with LM Studio, provided Adrenalin 25.8.1 WHQL or newer drivers are installed.
The Ryzen AI Max+ 395 platform offers up to 128 GB unified memory, with up to 96 GB convertible to VRAM, enabling 120B-class local inference and even MCP workflows thanks to the large memory headroom.
NVIDIA reports GPT‑OSS models are FP4/MXFP4 and fit on a single 80 GB data‑center GPU, with massive system-level throughput of up to 1.5 million tokens per second on GB200 NVL72; these optimizations span TensorRT‑LLM, vLLM, and Transformers.
H100 product documentation highlights large LLM acceleration and low‑latency inference capabilities designed for these workloads, explaining the observed throughput advantages at 120B scale.
Community and platform notes consistently show that Apple Silicon runs small to mid‑size models very well, but 120B models tend to be bandwidth‑limited on laptops; while feasible on a 128 GB M4 Max, token/s is typically below AMD’s Max+ 395 figures and far below H100.
Northflank’s deployment guidance indicates GPT‑OSS‑120B is intended for H100-class hardware for smooth, high-performance serving; their one‑click stack targets multi‑GPU or H100 setups with vLLM for best results.
CPU benchmark context for Ryzen AI Max+ 395 confirms it’s a high‑end mobile platform with strong single‑thread performance and laptop‑class power envelopes, aligning with AMD’s local LLM positioning.
Hugging Face’s overview reiterates that GPT‑OSS‑120B is an MoE model with 4‑bit quantization designed to fit on a single H100, while GPT‑OSS‑20B fits consumer hardware more readily; this architectural intent explains the throughput disparities across platforms.
Practical guidance
For the highest throughput and lowest cost per token at scale, use H100 (or newer Blackwell systems) with vLLM or TensorRT‑LLM, and leverage batching and paged KV cache.
For local “edge” feasibility, the AMD Ryzen AI Max+ 395 (128 GB) currently provides the most usable 120B-class experience on a consumer platform, around ~30 tok/s in vendor demos.
On Apple M4 Max 128 GB, prefer smaller models (20B or mid‑range MoE) for responsiveness; 120B will run but expect noticeably slower tokens/s due to unified memory bandwidth limits and thermal headroom on laptops.
Ready to get the best performance per dollar for your AI workloads? Tenten can help you benchmark on your hardware, tune vLLM/TensorRT‑LLM stacks, and design a hybrid local–cloud strategy tailored to your product. Book a meeting: https://tenten.co/contact






