Blog

Which Chip is Best? - Sam's Bait and Switch

There may not be an outright best GPU, but there's one that suits your needs.

Sam Krystal

12 Sep 2025 — 12 min read

There is no single best GPU. Sorry to disappoint you. Why bother reading this then? As I discussed in my previous post, different chips have been designed to win different races in different ways. I gave you the mental model for why chips feel fast or slow - this follow-up grows our model to help you compare the current contenders in the marketplace without boring you with endless benchmark tables (hopefully).

With some audited stats and high-level information ecosystem recaps, you’ll see why chip performances diverge the way they do and when you might want to use them. At the end I’ll hand out participation trophies but can’t give a gold medal in good conscience.

Setting the Stage for Comparison

Disclaimer 1

The chips that I’ll present to you are popular and emergent players in the space - a tasting menu of what enterprise, cloud, and sovereign buyers are considering today. I haven’t included chips that are too niche or ones that have been designed for a specific company (e.g., Google’s TPU).

Disclaimer 2

Chip vendors cherry-pick their data to make themselves look good. If you’re not careful, you’ll make your buying decision based on numbers that only hold under lab-perfect conditions. So we need a referee.

The referee I’ve chosen for this post is MLPerf, a robust and widely audited benchmark. MLPerf runs common AI workloads on various hardware offerings under strict conditions - their comparisons are well controlled; that’s why I trust them. From their various workload offerings, I’ll be referencing benchmark data from chips running on Llama-2-70B, a model size big enough to stress GPUs the way enterprise deployments might.

MLPerf tests chips workloads in two testing scenarios that are relevant for us:

Server: This is the “chatbot mode.” It assumes users arrive one after another, some fast, some slow, and the system has to juggle requests without missing deadlines. Server mode is all about latency. If you’re building a customer-facing assistant or a real-time product, this is the one that feels closest to your reality. TL;DR: Latency-sensitive, human-facing
Offline: This is the “bulk mode.” Think: pre-computing embeddings or generating text for a dataset. Here there’s no arrival pressure, so the system can chew through tokens as fast as possible. Offline mode tells you about raw throughput. TL;DR: throughput optimized, machine-facing.

Disclaimer 3

There are many software configurations that can impact chip performance. When talking about chip performance here, assume these chips are running their proprietary software, nothing funky.

Disclaimer 4

I’m centering my chip analysis predominantly around inference on purpose. Most major player spending goes to serving tokens, not training models. Training matters, but it’s a different stress test that benefits from many of the design choices that optimize token serving - I’ll cover it briefly through that lens. But at the end of the day, the buying decision most teams face is simple: how quickly and cheaply can I serve my models?

A Quick Refresher on the “Hot Path”

I promise not to recap my Bits and Bolts Primer, but let’s make sure we’re speaking the same language before we continue. Transformers, the architecture behind models like Llama and GPT, spend almost all their time doing matrix multiplications. The technical name for this is GEMM, but you can think of it as repeatedly multiplying giant tables of numbers together.

There are two stages that matter for comparing chips in inference:

Prefill: when you feed the model a big prompt. This stage is bandwidth-hungry. The GPU has to sling data back and forth between memory and compute units without stalling. The bigger the prompt or the batch, the more painful this gets.
Decode: when the model spits out tokens one by one. This stage is sequential. Each new token depends on the last, so you can’t parallelize as much. Decode performance depends on compiler tricks, batching smarts, and sometimes speculative decoding.

Remember: prefill speed is limited by either raw math power or memory bandwidth, whichever runs out first.

Context Windows, Because I Forgot

I forgot to mention this AI workflow concept in my previous post; this is what happens when you try to split one write-up into two!

There’s a third axis in inference that determines performance: how long the context window is. Every token a model has “seen” has to be stored in memory, and as the volume of stored tokens, or window, stretches from a few thousand tokens to hundreds of thousands, the storage consumed grows linearly. This storage is the KV cache, far and away the biggest consumer of memory during inference.

The longer the window, the more it stresses a chip’s capacity and bandwidth, requiring fancy solutions to not “forget” these stored tokens as a chip’s KV cache fills. A chip that performs well at 4k tokens might buckle at 400k - some designs absorb that pressure more gracefully than others.

Inference Comparison: Who’s In The Arena, Trying Stuff?

NVIDIA H100: The Incumbent

NVIDIA’s H100 is the chip everyone else measures against. It ships with 80 GB of HBM3 and touts 3.35 TB/s of bandwidth. Inside a cluster of H100s, which NVIDIA calls an HGX box, NVLink and NVSwitch give each GPU ~900 GB/s of peer-to-peer bandwidth, so sharded models run more smoothly.

NVIDIA’s moat is software. Its CUDA kernel, TensorRT-LLM acceleration software, and easy-to-scale Triton cloud inference server make H100 the least risky choice if you want a GPU that can scale reliably. As a result, most enterprise fleets still lean heavily on it.

TL;DR: NVIDIA’s H100 is a safe default. Balanced across workloads, easy to scale, with the most polished software ecosystem.

AMD MI300X: The Memory Monster

AMD’s MI300X is built around capacity. Each GPU carries 192 GB of HBM3 and ~5.3 TB/s of bandwidth. That’s more than double the H100’s memory.

This design pays off when prompts are long or batches are large. With the ability to hold more of a model on each GPU, fewer shards are needed, translating into smoother throughput. In MLPerf v5.0, a 32-GPU MI300X run hit 103,182 tokens/s (Offline) and 93,039 tokens/s (Server) the best public numbers so far.

The catch with AMD is software - their ROCm stack has matured quickly, but their developer ecosystem is smaller than CUDA, so fine-tuning the MI300X for your needs takes more work.

TL;DR: AMD’s MI300X is a prefill monster, with the most memory - best when bandwidth is a limiting factor.

Intel Gaudi 3: (Literally) Plug-and-Play

Intel’s Gaudi 3 doesn’t play the virtual scaling game. Instead, Intel builds Ethernet ports into their chip. That physical connection makes scale-up cheaper and simpler for data centers that already run Ethernet networks.

Performance lags H100 and MI300X - each Gaudi 3 carries 128 GB HBM and moves about 3.7 TB/s .

TL;DR: Intel’s Gaudi 3 isn’t the fastest big player, but it’s cheaper and easier to procure.

Cerebras WSE-3: The One-Chip Cluster

Who needs a cluster when your chip is massive? This is the Cerebras play; their WSE-3 has 900,000 cores and enormous on-chip SRAM. Instead of dozens of GPUs, you get one vast device - no need to shard your model! Only sovereign initiatives like DARPA buy the WSE-3 outright; most rent compute from Cerebras’ managed cloud. Cerebras’s cloud coordinates WSE-3s in their Condor Galaxy configuration, a managed supercluster capable of 16 exaFLOPS of compute. This arrangement makes pricing predictable, powerful, and easy to implement, but less flexible.

TL;DR: If you’re looking for powerhouse chips capable of smooth prefill, less scheduling variance, and a managed scaling service, Cerebras’ WSE-3 is for you.

Groq LPU: The Specialist

Groq’s LPU skates to one song and one song only - reliable high throughput and low latency for LLMs. Groq’s ASIC design is “deterministic”; instead of relying on software to coordinate its operations, the LPU’s compiler, low-level code used to make software “machine readable,” does the heavy lifting. With all operational coordination allocated to the compiler, Groq has unparalleled control over where and how tokens are generated, making token latency predictable and ensuring virtually no jitter. As a result, LPUs have smoother decode tails than anyone else on the market.

TL;DR: A heavy-hitting ASIC for jitter-free LLM inference.

Tenstorrent Blackhole: Open (Source) for Business

Tenstorrent’s Blackhole is an RISC-V–based ASIC with specialized Tensic cores and Ethernet links like the Gaudi 3. Tenstorrent’s software stack is open (tt-metal, TTNN), designed for developers who want to customize their setup down to the kernel. This openness is great, but its performance leaves much to be desired: ~120 MB on-chip SRAM and 12 GB GDDR6 on a 192-bit bus. In practice, this is far too small and slow to run big models.

TL;DR: The tinkerer’s GPU. Choose it to learn and prototype, not to serve a 70B model.

Cambricon Siyuan 690: What China’s Working With

Cambricon is China’s flagship accelerator vendor, and its Siyuan 690 is China’s answer to NVIDIA’s H100. Not much is known about the 690’s performance, interconnect methods, or pricing - what we do know is that it is built around the MagicMind/NeuWare stack and is the national favorite over chips from other Chinese companies like Huawei for AI workloads.

TL;DR: The practical choice in China.

Training

For inference, customers need to know how quickly a chip can respond to requests. For training, customers need to measure something different: how efficiently hundreds or thousands of chips can work together for weeks at a time. Training workloads are larger, their timelines are longer, and the coordination burden is heavier. A GPU with strong FLOPS stats but weak interconnects spends much of its time waiting in a cluster. This is why training benchmarks diverge from inference results; raw compute doesn’t matter if a chip can’t communicate efficiently at scale.

NVIDIA H100

What makes the H100 solid at inference (polished kernels, stable runtime) compounds in training because NVLink/NVSwitch keep in-node sync fast and CUDA/NCCL hide a lot of orchestration pain. That’s why large clusters tend to scale cleanly here (NVLink/NVSwitch). Just as the H100 is “balanced and boring” serving tokens, it is “predictable and scalable” in training - same moat, different flow.

AMD MI300X

The 192 GB and ~5.3 TB/s that make MI300X great at prefill also reduce shard count in training. Fewer shards means less GPU cross-talk, which means fewer opportunities to stall out. You still work harder on software than with CUDA, especially off a single node, but memory headroom is real leverage (MI300X). The design that helped you swallow long prompts helps you fit larger layers per device, lowering your synchronization performance tax.

Intel Gaudi 3

The Ethernet-first design that makes Gaudi economical at inference shows up in training as well. You can scale out your cluster for cheap, at the cost of higher fabric latency than proprietary virtual fabrics. With careful topology and tuning, you can absolutely train on Gaudi 3s; the pitch is cost and sovereignty, not best-in-class time to full model development. (Intel CDRD) 

Cerebras WSE-3

The “fewer, fatter nodes” design that removes sharding pain in inference removes communication pain in training too. You rent that smooth performance via Condor Galaxy and skip a lot of cluster plumbing.

Who doesn’t train?

You might notice that some of the GPUs we’ve discussed for inference don’t appear in training at all, for good reason - they’re ASICs that are not designed for training altogether.

Groq: Its deterministic pipeline is designed for low-latency inference exclusively. Training would require flexible, all-to-all communication it doesn’t provide.

Tenstorrent Wormhole: These developer-friendly boards are meant for kernel and compiler work, not racks of synchronized training.

How to Price Chips, As Well as One Can

Benchmarks tell you how fast a chip can perform an operation, but you need to know how much everything will cost! Pricing across vendors is a closed system - most don’t publish clean or standardized total cost of ownership on their websites, and how much a chip may cost to purchase varies drastically. Instead, I’ll use my own model to provide you a sense of how to evaluate chip run-time pricing with the sparse data we have.

The simplest way to compare across chip platforms is to normalize performance into dollars per million tokens from on-demand GPU-hour prices and audited tokens from the same model (in this case Meta’s Llama-2-70B).

My formula:

Cost per Million Tokens ≈ (Hourly Price ÷ Tokens/second) × (1,000,000 ÷ 3,600)

This model ignores the cost of power to run the chips, the depreciation of chip value, and utilization costs, but it gives you a comparable baseline across vendors.

I’ll proceed by showing you how these calculations shake out for three of our covered chips - why only three? NVIDIA and AMD have the most widely audited throughput and widely available per-GPU cloud pricing, given their maturity in the marketplace. I’ve included Intel; despite Gaudi 3’s relatively sparse data, its predecessor, Gaudi 2, is purportedly similarly performant. I’ve excluded Groq and Cerebras from this section because they are predominantly sold as managed services with posted token prices or service SLAs - a different buying motion.

NVIDIA H100

On AWS, a single H100 SXM in a p5 instance lists around $4.10–$5.00/hour per GPU. From MLPerf Inference v5.0, the best public 32-GPU H100 run delivered 82,749 tokens/s Offline on Llama-2-70B, which works out to ~2,585 tokens/s per GPU. Plugging into our formula:

$5/hour ÷ 2,585 tokens/s = $0.000537 per token
Scale to a million tokens, and you get ~$0.54 per million tokens.

This does not include reserved capacity discounts from cloud vendors (often 30–50%), but it sets a fair on-demand baseline. Compared to AMD and Intel, the H100’s price is higher, but customers pay for workload reliability.

AMD MI300X

Azure recently listed MI300X VMs at $3.46/hour per GPU in its East US2 region. MangoBoost’s MLPerf Inference v5.0 run reported 103,182 tokens/s offline for a 32-GPU MI300X cluster, ~3,220 tokens/s per GPU. Again, let’s use our formula:

$3.46/hour ÷ 3,220 tokens/s = $0.00038 per token
Scale to a million tokens, and you get ~$0.38 per million tokens.

This undercuts NVIDIA’s H100 by ~30%. Because MI300X carries 192 GB of HBM3, it also reduces shard count for long prompts, which saves additional cross-GPU overhead. But remember, while the MI300X may be cheaper per token, customers should price in the work required to deal with AMD’s ROCm ecosystem.

Intel Gaudi 3

Intel has been explicit about pricing Gaudi hardware below NVIDIA to win market share. Dev kits are priced significantly lower, and cloud pricing is expected to follow. If we assume $2.50/hour per GPU (consistent with Intel’s positioning vs. H100) and extrapolate ~2,400 tokens/s per GPU from Gaudi 2 MLPerf v2.1 results on BERT and ResNet-50, then

$2.50/hour ÷ 2,400 tokens/s = $0.00029 per token
Once again, scale to a million tokens, and you get ~$0.29 per million tokens.

Audited MLPerf v5.0 inference numbers for Gaudi 3 are not yet public, so this is an informed estimate. Intel’s prices are attractive where rack-level economics and commodity networking matter more than peak chip performance.

Sensitivity and reality

These pricing numbers don’t account for two critical variables:

1. Utilization: A runtime that keeps GPUs 70% busy can halve effective cost compared to one stuck at 30%. Effective multitenancy (workload distribution) is where runtime maturity has real economic impact.

2. Rental pricing: Reserved and spot instance rental from neoclouds and hyperscalers can cut prices by another 30–50%. A buyer with steady demand but not enough capital to buy their own GPUs up front will pay far less than a company that buys their stack. Furthermore, GPU buyers are stuck with chips that may be outperformed by future generations of chips. Many such cases.

Markets & constraints

Price tells you who’s cheap, but there are other reasons you might not buy a chip.

Managed Services: Renting Outcomes Out of the Box

Groq offers clear pricing and performance because they mostly rent their chips from their own cloud. To that end, Groq’s pricing and performance are quite competitive: ~276 tokens/s per stream, $0.59/M input, and $0.79/M output. GroqRack exists for edge-case customers, but self-deployment is rare.

Cerebras does the same at a bigger scale. Trying to build a ~16 exaFLOPS stack is a tall order; Cerebras’ managed solution guarantees massive performance and consistent service level agreements without setup headaches.

Virtual Fabrics: The Lock-in Trap

Buyers are wary of buying into NVIDIA and AMD’s stacks because once you’ve designed your systems around their scaling fabrics, it’s difficult to switch off their platforms. These platforms aren’t simple to integrate, either, and you need engineers with the know-how to integrate with CUDA, NVLINK, or ROCm to make full use of H100s and MI300X's, respectively. Ethernet solves this problem. Gaudi 3 and Tenstorrent’s Blackhole aren’t the highest-performing chips on the market, but choosing these chips is a calculated choice - you swap a bit of peak efficiency for commodity networking, vendor independence, and easier staffing.

Export Controls: When Policy Picks Your Chip

U.S. regulations restrict shipments of advanced GPUs to China, so Chinese buyers evaluate domestic silicon on availability and software stack, not performance parity to cutting-edge models. It remains to be seen whether Chinese chips can or will catch up to American offerings in performance, but for the moment, China has to work with what they’ve got.

Come on, which chip is Best Though?

Once again, I’m not awarding a gold medal.

NVIDIA H100: the safe default, balanced across prefill and decode with the strongest software moat.
AMD MI300X: the memory monster, best for long prompts and batch-heavy inference.
Intel Gaudi 3: the economics play, attractive for sovereign buyers and Ethernet-native data centers.
Cerebras WSE-3: fewer, fatter, easy-to-implement nodes.
Groq LPU: inference reliability, perfect for latency-sensitive products.
Tenstorrent Wormhole: the tinkerer’s board, open and flexible for smaller-scale workloads.

My previous post was about how GPUs work; this essay is about how your workload decides which chip works best for you. There’s no single winner, and the right choice depends on whether you care most about prefill throughput, decode latency, context length, networking economics, or just having something you can actually buy.

Which Chip is Best? - Sam's Bait and Switch

Sam Krystal

Setting the Stage for Comparison

Disclaimer 1

Disclaimer 2

Disclaimer 3

Disclaimer 4

A Quick Refresher on the “Hot Path”

Context Windows, Because I Forgot

Inference Comparison: Who’s In The Arena, Trying Stuff?

NVIDIA H100: The Incumbent

AMD MI300X: The Memory Monster

Intel Gaudi 3: (Literally) Plug-and-Play

Cerebras WSE-3: The One-Chip Cluster

Groq LPU: The Specialist

Tenstorrent Blackhole: Open (Source) for Business

Cambricon Siyuan 690: What China’s Working With

Training

NVIDIA H100

AMD MI300X

Intel Gaudi 3

Cerebras WSE-3

Who doesn’t train?

How to Price Chips, As Well as One Can

NVIDIA H100

AMD MI300X

Intel Gaudi 3

Sensitivity and reality

Markets & constraints

Managed Services: Renting Outcomes Out of the Box

Virtual Fabrics: The Lock-in Trap

Export Controls: When Policy Picks Your Chip

Come on, which chip is Best Though?

Read more

twoway: Encrypted Request-Response Messaging in Go

Reading List - October 6, 2025

Reading List - September 29, 2025

Reading List - September 22, 2025