Which Chip is Best? - Sam's Bits and Bolts Primer

Which graphics processing unit is best? I didn't know how to answer that question, or what that even meant. Here's a distillation of some materials that got me up to speed.

Which Chip is Best? - Sam's Bits and Bolts Primer

The more I tried to figure out how to compare GPUs, the more I realized how little I knew about how GPUs actually work. As part of my learning, I wrote a full write-up of 20-ish pages, but nobody has time to read that. I’ve decided to share my monstrous write-up by breaking it into two parts: a primer and a comparison of the GPU ecosystem. If you read the introduction and TL;DR at the end, you might be able to appreciate my next essay, depending on how much knowledge you already have.

Introduction

Strip away the hype, and a graphics processing unit (GPU, or colloquially, “chip”) is a machine designed to do math millions of times very quickly. GPUs were originally intended to render computer graphics (surprise!), which requires solving massive numbers of linear algebra problems in parallel to shade millions of pixels and draw thousands of triangles at once. A GPU’s ability to “parallel process” with thousands of cores makes it better suited for these tasks than a central processing unit (CPU), which is more like a single, fast, general-purpose brain.

Artificial intelligence (AI) workloads are also built on linear algebra. For this reason, GPUs were utilized as the default hardware accelerator for AI. Kind of funny that the same architecture that once drew dragons in your favorite game can now generate them on demand with AI, no?

Going Deep: How does AI use a Chip?

Tensors

To understand how AI makes use of GPUs, you need to understand transformer architecture.

Transformers, the neural networks powering today’s large language models, are massive stacks of linear algebra. Every word sent to an AI model is broken into tokens, small text chunks like “walk” or “run,” mapped into numbers by an embedding layer. These numbers are multiplied against billions of learned weights stored in matrices, parameters that encode statistical relationships discovered during training. Running a prompt through a model is mostly applying those matrices over and over, layer by layer, to turn token relationships into predictions.

Training

Training is where a model is created, running the aforementioned process forward and then backward. The forward pass is straightforward: tokens flow through each layer, and attention calculates relationships between every token and every other token, all implemented as massive matrix multiplications. This is followed by a backward pass, where “learning” happens. After a model predicts the next token, a loss function measures how wrong it was. Backpropagation works backward through the network, calculating gradients that show how much each weight contributed to the error. Those gradients are aggregated across multiple GPUs, and an optimizer updates the weights accordingly. Repeat this billions of times, and you get an AI model.

Inference

Inference is the process of using a trained model. When a trained model generates text, it processes your prompt in a phase called prefill, where every token is streamed through every layer in parallel. Prefill is dominated by bandwidth: all the model’s weights have to be fetched and applied quickly. Once that’s done, the model moves into decode, generating one token at a time, each step depending on the one before. Decode is latency-sensitive and relies heavily on an internal key-value cache that stores previous attention results so they don’t need to be recomputed. On large deployments, that cache is often split across GPUs, meaning every token generation involves fetching pieces over the multiple interconnected GPUs.

Every aspect of GPU design is optimized to speed up these steps.

Going Deeper: Inside a GPU

Cores and Kernels: The Number Crunchers

A GPU does math by coordinating a swarm of small cores, or math-crunching units. Unlike CPU cores, which are large and versatile, GPU cores were small but specialized, though GPU core design has been trending towards CPU core complexity to accommodate higher operations per second (FLOPS) and power usage.

Core-work on a GPU is expressed as a thread, a sequence of instructions acting on one piece of data. Threads are grouped into warps (32 threads on NVIDIA’s chips) or wavefronts (64 on AMD’s chips). All threads in a warp execute the same math instruction at the same time on an array of data, known as running in parallel. This single-instruction, multi-data compute model is what makes GPUs so good at linear algebra.

Modern GPUs accelerate this further with Tensor Cores (NVIDIA) or Matrix Cores (AMD). These units are hard-wired for small-block matrix multiplications, the core operation of deep learning. A tensor core can multiply two 4x4 or 8x8 matrices in a single cycle. With thousands of tensor cores active, throughput can reach petaflops (10^15 FLOPS, or 1 quadrillion).

All of these processes run on kernels, small programs that tell the GPU how to perform operations like matrix multiplication. A kernel spawns thousands of threads, (ideally) optimally routing these threads to ensure they are distributed across cores efficiently for maximal throughput. A bad kernel leaves cores idle when they could be doing math!

Memory: The Number Feeders

If the cores are the brain, memory is the heart. Two metrics matter most: capacity, which determines whether a model fits on one device or needs to be split across several, and bandwidth, which determines how fast data can be pulled into compute.

A GPU’s memory is not “one component” but a hierarchical system that trades speed for storage capacity. Most GPU hierarchies follow this structure:

  1. On-chip memory registers, which store all values being immediately processed.
  2. L1 cache (shared memory), which sits close to cores for fast memory provision.
  3. L2 cache, which bridges between local memory (fast but small storage) and global memory (slow but large storage).
  4. High Bandwidth Memory (HBM) provides larger storage for weights and activations, at the cost of higher latency. While technically a “part” of the chip, HBM is mounted, not fully integrated, into the silicon that houses GPU components.

Training pushes this entire hierarchy to its limits. Weights, activations, gradients, and optimizer states all compete for memory space, while activations and attention maps are written and read layer by layer. Inference is memory intensive as well - prefill saturates bandwidth as weights are streamed through every layer, while decode draws heavily from HBM when generating tokens. If a model exceeds HBM capacity, it has to be split, or sharded, introducing network traffic every time memory layers exchange data.

Vendors lean into these constraints differently. AMD emphasizes a large-bandwidth memory stack on its MI300X chip. NVIDIA, on the other hand, spends enormous effort on kernel and compiler optimization so existing bandwidth is used more efficiently. Cerebras (to me, the most interesting) sidesteps the memory-chip bottleneck by physically integrating its memory directly into its silicon.

Topology: The Number Nervous System

Most models are too large for a single GPU to hold, so multiple GPUs must be connected to serve them. These GPU connections, their topology, dictate whether multiple GPUs scale well or stall. Like with memory, different chip manufacturers have designed unique solutions to coordinate multiple GPUs at once:

  • NVIDIA connects their flagship H100 GPUs with NVLink/NVSwitch, a fabric (high-bandwidth switching interconnect) that lets each GPU communicate with peers at ~900 GB/s, effectively flattening communication latency for an eight-GPU cluster.
  • AMD’s MI300X chip uses xGMI, its proprietary coherent interconnect protocol, providing ~128 GB/s per connection across up to seven connections (~896 GB/s aggregate).
  • Intel’s Gaudi 3, on the other hand, uses physical Ethernet ports: each accelerator exposes 24x200 GB/s connections, remarkably fast for a physical connection. This Ethernet connection affords Gaudi 3 users the opportunity to scale their GPU clusters with the same gear already available in data centers, lowering vendor lock-in but introducing slightly more latency.
  • Cerebras (once again!) takes an even more radical approach: instead of using multiple GPUs, its massive wafer chip, the WSE-3 (Wafer-Scale Engine), has 900,000 cores embedded and routed by their SwarmX software. For comparison, NVIDIA’s H100s only have 18,432 cores.

Topology is a GPU cluster’s nervous system. Without a high-performance interconnect, adding more GPUs makes your system slower, not faster, because communication overhead dominates computation.

Where Things Fall Apart: Failure Modes and Scaling Limits

Slow models almost always boil down to a few problems:

Latency

When talking about AI, latency is the time required to generate the next model token. Delays can accumulate across multiple stages of the inference pipeline, though the most noticeable impact occurs during the decoding phase. Because each token must be generated sequentially, even small per-token slowdowns compound rapidly. A delay of just a few milliseconds per step can make the difference between a fluid interaction and one that feels stuck in molasses.

Jitter

Jitter is changes in latency. Even if your average latency rate is tolerable, inconsistent latency ruins a user experience - think of jitter like driving in a car that speeds up or slows down at random intervals (nauseating!). Groq’s LPU (an ASIC, described later) has been designed specifically to reduce jitter by ensuring every token takes exactly the same amount of time to process.

Sharding

Sharding is how we make models too large for a single GPU fit across many devices. Instead of one GPU holding all the parameters and activations, the model is divided into pieces. Each strategy solves one problem but introduces new complexity:

  • Data Parallelism (sort of sharding but really): Every GPU holds a full copy of the model but processes a different slice of the training data. After each batch, GPUs run an operation called all-reduce to average gradients and synchronize weights. This is easy to implement and scales well, but memory use is high because every device stores the entire model.
  • Tensor (Model) Parallelism: Large weight matrices are split across GPUs, so each device stores only part of the model. Every multiplication now requires GPUs to exchange partial results, increasing communication overhead but saving memory.
  • Pipeline Parallelism: Layers are split across GPUs like an assembly line. Tokens flow stage by stage, but if one stage takes longer than the others, “bubbles” appear in the pipeline, leaving GPUs idle.
  • Optimizer Sharding (ZeRO): Optimizer states are split across GPUs to allow training trillion-parameter models. This dramatically reduces memory per GPU but requires even more frequent communication.

Sharding is what allows frontier models to exist, but it comes at a price: every operation that crosses devices adds latency, consumes bandwidth, and complicates scheduling. Scaling becomes a balancing act between memory efficiency and communication overhead.

New Chip Design Tradeoffs: Generalists vs Specialists 

GPUs have been the mainstay for AI computation because they are flexible and programmable. But emergent, specialized ASICs (application-specific integrated circuits) take another route, hard-wiring transformer capabilities directly into a chip’s silicon. There are many trade-offs in choosing to use ASICs or GPUs; foremost is adaptability. GPUs can adapt with software improvements, gaining performance through new kernels and compiler optimizations without altering hardware. ASICs, on the other hand, cannot pivot as their silicon is fixed (think: Cerebras). As such, ASICs are efficient specialists for stable, specific workloads, but GPUs remain the general-purpose option for training and evolving model architectures.

TL;DR

A GPU is a machine built to do one thing at scale: parallel linear algebra. Its thousands of cores execute the same simple operations on massive batches of data at once, which is why transformers, the neural networks behind modern AI, run so well on them. Training a model is a cycle of forward and backward passes that maxes out compute, memory, and interconnects; inference reuses that same architecture, shifting pressure to memory bandwidth during prefill and per-token latency during decode. Memory hierarchy (registers, caches, and high-bandwidth memory) and GPU-to-GPU interconnects are just as important as raw FLOPs because data movement is the real bottleneck. Scaling models across many devices introduces complex trade-offs: sharding strategies like tensor or pipeline parallelism save memory but add communication overhead. This is why software (compilers, kernels, and scheduling) means as much as hardware design. Specialized chips like TPUs or wafer-scale engines can outperform GPUs in fixed inference tasks, but GPUs remain the backbone of AI because they balance raw throughput with flexibility, evolving alongside ever-changing models and workloads.

With this foundation, you can (hopefully) start to understand how AI workloads flow through a GPU and the design trade-offs that chip makers face. Now, we can look at how different GPUs on the market from companies like NVIDIA, AMD, Intel, Cerebras, Groq, and others embody these trade-offs in practice - and assess which chip is best.