Inferencing made faster with hardware

In the digital world, we often take the speed of our tools for granted. We trust our computers to compute instantly, and we expect immediate gratification. But if you have ever watched a Large Language Model trickle out a response one word at a time, you might have felt a terrifying flashback to 1998 dial up internet.

The truth is, today’s AI has hit a massive, silicon reinforced brick wall. We are desperately trying to optimize Time To First Token (TTFT), streaming latency, and overall inference speeds. But the underlying architecture of how we compute is fundamentally betraying us.

Here is a deep dive into the engine room of modern AI hardware, the clever software tricks trying to save us, and the companies literally rewriting the rules of silicon to reach instant generation.

The Tyranny of the Memory Wall: A Deal with the Devil

To understand the problem, you have to understand how an LLM generates text. It is a strictly sequential process.

Take the popular Llama 3.1 70B model. It has 70 billion parameters. If you use standard 16-bit precision, that is roughly 140GB of data. Because an LLM calculates probabilities based on context, to generate just one single word, the GPU has to move that entire 140GB of parameters from its memory to its compute cores.

Want 100 words? You have to move 140GB of data 100 times.

Standard GPUs like the NVIDIA H100 only have about 200MB of on-chip memory. The rest has to be fetched from outside the chip. This is the deal with the devil we made for general purpose computing. The chips are phenomenally fast at doing the math, but they spend all their time twiddling their thumbs waiting for the data to arrive.
Imagine trying to empty a swimming pool using a cocktail straw.

To anchor this entire battle against the memory wall, we have to look at the exact mechanics of how a Large Language Model actually works under the hood. Most people assume that an LLM just takes your text and spits out an answer. In reality, every single inference request is split into two completely different, warring phases: Prefill and Decode. Understanding these two concepts is the absolute key to understanding why AI generation feels so sluggish today.

The Heart of the Problem: Prefill vs. Decode

When you send a prompt to an AI model, the first thing it does is read your entire input all at once. This is the Prefill phase. Because the model is ingesting your whole block of text simultaneously, it can run many tokens through its weights in parallel. This is a massive math party. The GPU compute cores are firing at 100% capacity, processing matrix multiplications across all its processing units. Because the processor's speed determines how fast this phase finishes, it is considered compute-bound. During prefill, the model calculates intermediate mathematical results for every single word in your prompt and saves them in a temporary scratchpad called the KV Cache (Key-Value Cache) so it doesn't have to re-calculate them later.

But the moment the model finishes reading and starts typing its response, the architecture flips completely on its head.
This is the Decode phase, and it is a sequential nightmare. The model cannot generate a whole sentence at once. It can only predict one single token at a time. To generate token #2, it must read your prompt, read token #1, and pull the entire weight matrix of the model through the chip to do a tiny bit of math. Then, to generate token #3, it has to do it all over again. Because the compute cores spend 99% of their time just waiting for the model weights and the growing KV cache to travel from the memory chips onto the processor, decode is completely memory-bound.

📊 Visualizing the Bottleneck

To see exactly how the KV Cache attempts to solve this problem by removing redundant calculations, check out this architectural breakdown:

NVIDIA & Cerebras: KV Cache Inference Mechanics

When a data center tries to run both of these phases on the exact same GPU pool, they constantly trip over each other. A single user sending a massive, long prompt will trigger a heavy Prefill pass that completely hogs the GPU compute cores, causing the streaming output (Decode) of ten other users to grind to a halt. This interference is the ultimate bottleneck holding back instant AI.

This creates a brutal dichotomy in AI processing. The prefill phase (reading your prompt) is compute bound, meaning it is limited by how fast the chip can do math. The decode phase (generating the response) is memory bound, meaning it is limited by how fast we can read the saved results from memory. When you run both of these phases on the exact same GPU, they interfere with each other terribly.

The Multi-Rack Illusion and the HBM Bottleneck

You might be thinking that if one chip lacks enough memory, we could just buy a truckload of GPUs and duct tape them together.

Welcome to the multi-racking illusion.

Sure, you can string together thousands of H100s using NVLink and InfiniBand. This creates massive data center clusters that consume the power output of a small city and cost more than the GDP of a small island nation. But adding more GPUs across server racks mostly just increases your throughput, meaning how many users you can serve at once. It does almost nothing for your latency, or how fast a single user gets their answer.

Crossing the physical boundary from one server rack to another is the silicon equivalent of playing the world’s most expensive game of telephone. The speed of light literally becomes a bottleneck.

To fix the on-chip memory issue, NVIDIA and AMD rely on High Bandwidth Memory (HBM). HBM is basically 3D stacked RAM that sits directly adjacent to the GPU die. It is an absolute marvel of engineering. It is also incredibly difficult to manufacture, incredibly expensive, and currently facing a massive global shortage. So, we are stuck trying to feed a 140GB model through a microscopic, gold plated straw that we can barely afford to manufacture.

The Amnesiac Chef: Understanding the Decode Nightmare

To really internalize why the decode phase of an LLM is so painfully memory bound, we have to look at the curse of auto-regressive generation and the heavy toll of the KV Cache.

Imagine a Michelin star chef who can chop, sauté, and plate at the speed of light. These are your GPU's compute cores, capable of trillions of operations per second. But there is a catch. This chef has absolutely zero short term memory, much like your GPU's tiny on-chip SRAM.

Now, thanks to a brilliant software invention called the KV Cache, the chef does not have to re-chop the first 99 carrots just to figure out what to do with the 100th. They keep a running tally on a notepad of everything they have prepped so far, which removes redundant processing.

But here is the fatal flaw. The chef's counter space is still too small to hold the 140-ton encyclopedia of culinary rules (the 70B model weights).

So to cook step 100, the chef still has to walk down to the basement, haul up the entire 140-ton recipe book, read the rule for step 100, apply it to the chopped carrot, and haul the book back down. Why can they not bring up the rules for step 101 at the same time? Because the chef does not know what to cook for step 101 until they taste exactly how step 100 turned out.

To make matters worse, as the tasting menu goes on, that notepad gets thicker and heavier. For a long prompt, the KV cache itself can grow to tens of gigabytes. So now our poor chef is hauling a 140-ton recipe book plus a massive notepad up and down the stairs for every single token.

Your insanely fast compute cores are basically sitting around playing solitaire, waiting for the memory bus to physically drag a mountain of data up the stairs just so they can do their one tiny calculation. We are not compute bound. We have more math doing power than we know what to do with. We are horribly, tragically memory bound.

The Software Hacks: Our Best Guesses

Since we cannot magically upgrade everyone's hardware overnight, engineers have come up with some brilliant software workarounds.

1. Speculative Decoding: The Architect and the Intern Imagine a brilliant, highly paid Senior Architect (your massive 70B parameter model) and an over-caffeinated, slightly erratic Intern (a tiny, fast 1B parameter model).

If you ask the Architect to write a report, they will ponder every single word, taking a slow, heavy trip to the basement library (Main Memory) to haul up 140GB of data for every syllable. It is highly accurate but painfully slow.

Now, you might be thinking: If we make the Intern write a draft, aren't they also writing word-by-word? Doesn't that autoregressive process add the exact same latency?

Here is the thing: The Intern's brain is tiny. While the Architect is powerlifting a 140-ton boulder, the Intern is sprinting with a feather. The Intern can generate five words sequentially in a fraction of the time it takes the Architect to generate just one. Dragging 2GB of weights across the memory bus five times is vastly faster than dragging 140GB across it even once.

Once the Intern bangs out those five words, they hand the draft to the Architect.

Here is where the real magic happens. Because the Architect already has the proposed tokens, they do not need to generate them sequentially. They can process all five words simultaneously in a single, highly-parallel compute pass. One trip to the basement. Five words verified.

If the Architect says, "Yep, these first three words are exactly what I would have written," we keep them. If the Intern hallucinated on the fourth word, the Architect crosses it out, writes the correct fourth word, and throws away the fifth. We just generated up to five tokens for the time penalty of one. It is a beautiful exploitation of the fact that checking someone else's math in parallel is much faster than doing it yourself from scratch.
Easy to verify an answer once you know what it is!

2. MoEs and Self-Distillation

Chinese open source models have mastered the art of doing more with less. By heavily utilizing Mixture of Experts (MoE) and self-distillation, they drastically decrease the number of active parameters during any given inference pass. If the network only needs to wake up 15B parameters to answer a question instead of 70B, that is significantly less data you have to drag across the memory wall.

3. Disaggregated Inference If prefill and decode hate sharing a room, why not separate them? Disaggregated inference solves this by splitting the work across dedicated hardware.

AWS and Cerebras paired up to use Trainium chips for prefill, passing the KV cache to Cerebras CS-3 systems for decode.
SambaNova uses GPUs for prefill and RDUs (SRAM, HBM, and DDR) for decode.
NVIDIA and Groq split the model itself, running the attention blocks on a Rubin GPU and the feed-forward blocks on a Groq LPU. Groq's LP30 chips are practically built for this, utilizing 500 MB of SRAM to deliver an insane 150 TB/s bandwidth per chip.

Cerebras: Keeping the Whole Pizza

But what if you did not want to use software workarounds? What if you just wanted to conquer the hardware problem with sheer, unapologetic engineering dominance?

Enter Cerebras.

To appreciate what Cerebras did, you have to understand how microchips are made. Silicon is manufactured in large, circular plates called wafers.

Normally, chipmakers use a laser to etch hundreds of identical little squares onto this wafer. Then they slice the wafer up, throw away the squares that have microscopic manufacturing defects, and sell the good squares as individual GPUs. This is called the reticle limit, and it is why your CPU or GPU is the size of a postage stamp.

Cerebras looked at the memory bandwidth barrier and made a radical choice. They refused to cut the pizza.

Their third generation Wafer Scale Engine (WSE-3) is a single, uninterrupted square of silicon the size of a dinner plate. But what about the manufacturing defects? Cerebras designed a brilliant routing system that simply tests the giant wafer, finds the dead zones, and dynamically routes traffic around them.

Because they did not chop the chip into tiny pieces, they do not have to rely on external memory. They managed to bake an incomprehensible 44GB of SRAM directly onto the silicon.

There is no external memory. There are no slow lanes. The entire model sits on chip!!!.

By eliminating the commute to the basement, the WSE-3 achieves 21 holy-mother-of-god-freaking-petabytes per second of aggregate memory bandwidth. That is roughly 7,000 times the bandwidth of an NVIDIA H100.

The result? They run Llama 3.1 8B at 1,800 tokens per second, and the massive 70B model at an instantaneous 450 tokens per second. That means generating entire paragraphs before you even finish blinking.

And the best part is they do not cheat. Many hyperscale cloud providers secretly shrink their models to 8-bit precision just to fit them into standard GPU memory, which quietly degrades the model's reasoning and math skills. Cerebras has so much native capacity that they run the original, uncompromised 16-bit weights. It is an honest system that bypasses the greatest bottleneck in computer science simply by refusing to play by the physical rules everyone else accepted.

Chat Jimmy (Taalas): Hardwiring the Brain

If you've reached till here, and you think okay model on a single wafer is cool/crazy, let's go further down to assembly language of AI

If Cerebras is a dinner plate of compute, a startup named Taalas with their tech demo "Chat Jimmy" is doing something that sounds like absolute science fiction.

They looked at the Von Neumann architecture, the very idea of loading software into hardware, and threw it out the window.

Instead of storing model weights in RAM and fetching them, Taalas physically etches the neural network architecture and quantized model weights directly into the silicon transistors using custom ASICs (Application-Specific Integrated Circuits).

The model is the computer. Try it yourself and DON'T BLINK (doesn't matter anyway) - https://chatjimmy.ai/

The results are staggering. Chat Jimmy runs a hardwired Llama 3.1 8B model at a marvel-cinematic-universe-flash-speed of roughly 15,000 to 17,000 tokens per second, yeah you read that right. It bypasses the memory wall entirely because there is no fetching. The input simply flows through the physical logic gates as an electrical signal. I bet you didn’t think physics could be leveraged quite like this :)

The Pros:

Unmatched Speed and Efficiency: It is roughly 20x cheaper to run and infinitely more power efficient.
Zero HBM Required: It sidesteps the entire global shortage of high bandwidth memory.

The Cons:

Extreme Rigidity: This is the ultimate cost of specialization. If Meta releases Llama 3.2 tomorrow, or if you discover your model has a nasty hallucination bug, you cannot just push a software update. You literally have to manufacture a brand new physical microchip.

It is an incredibly pragmatic solution for stable, long term production models, but a massive gamble in a landscape where model architectures evolve in weeks instead of years.

God! if they etch Fable 5 onto a chip, i think we might open a singularity of self-reproducing robots and a cataclysmic robo-lyptic war, but who knows how far that day is.

We are at a fascinating crossroads. From routing requests cleverly in software, to building wafer sized supercomputers, to literally baking intelligence into physical matter, the race to break the memory wall is the most exciting engineering challenge of the decade. We have realized that throwing more racks of GPUs into a data center is not the silver bullet we hoped it would be. It is just a very fast, very expensive band aid.

Given how rigid the hardwired ASIC approach is, do you think the future of AI will lean more toward flexible behemoths like Cerebras, or highly specialized, printed to order chips like Taalas? Or will we just keep stacking HBM until the heat melts the Earth's crust?

Chip design from the bottom up – Reiner Pope

If you want to understand the foundational logic behind why GPUs, TPUs, and ASICs are architected the way they are to combat these memory bottlenecks, this blackboard lecture by Reiner Pope provides an excellent, bottom-up mathematical breakdown.

The Great AI Brick Wall: How We Are Shattering the Memory Barrier

The Tyranny of the Memory Wall: A Deal with the Devil

The Heart of the Problem: Prefill vs. Decode

📊 Visualizing the Bottleneck

The Multi-Rack Illusion and the HBM Bottleneck

The Amnesiac Chef: Understanding the Decode Nightmare

The Software Hacks: Our Best Guesses

Cerebras: Keeping the Whole Pizza

Chat Jimmy (Taalas): Hardwiring the Brain

Comments

Things i find fun to learn

More from this blog

When Text Becomes Too Heavy: How DeepSeek Made AI Read Like Humans Do

The Magic Behind the Android Calculator

KL Divergence

Generative Adversarial Networks(GANs)

Command Palette

The Tyranny of the Memory Wall: A Deal with the Devil

The Heart of the Problem: Prefill vs. Decode

📊 Visualizing the Bottleneck

The Multi-Rack Illusion and the HBM Bottleneck

The Amnesiac Chef: Understanding the Decode Nightmare

The Software Hacks: Our Best Guesses

Cerebras: Keeping the Whole Pizza

Chat Jimmy (Taalas): Hardwiring the Brain

Comments

Things i find fun to learn

More from this blog