5 Comments

Thanks Vikram! A good reminder just how close to perfection a process has to be to avoid fabbing mostly expensive garbage. You already mentioned binning, which is crucial, especially for very large chips. The other factor is how sensitive or resistant the chip's design and function is to defects. A good design takes the likelihood of at least a few defects without losing function altogether into account. The other factor is the functional type of chip; for example, Nvidia's largest Blackwell GPU die is close to the size limit of the reticles TSMC currently uses for its EUV nodes, and probably the last monolith of its kind and size. A GPU die has, by its nature, many (thousands to hundreds) identical functional cores. Combine that with good design and great manufacturing, and the functional yield is still pretty good, despite having many billions of transistors per die. In contrast, a CPU (more functions, less redundancy) is a lot more sensitive to defects, as the likelihood that a fatal defect occurs in an essential and non-redundant function is much higher. Hence, the incentive of going to a chiplet/tile type design is greater for larger CPUs, despite the costs for the packaging required to get a functioning CPU.

As a request: a follow up article on how finished dies are tested and binned would be great - Thanks!

The ability to, for example, test individual chiplets is, of course, foundational for a chiplet design to actually make sense.

Expand full comment

Nice detailed comment! It's also interesting how Cerebras WSE deals with defects because it's unavoidable. Cerebras actually routes around the failing GPU cells and reconfigures it. Elegant way to deal with it.

Test and binning is interesting. When dealing with chiplets, people usually think it's obvious that going smaller to chiplets is always better for the system because of better yield. Reality is quite a bit more complex. I've been thinking about this.

Expand full comment

The cool part is that Cerebras's WSE is the only commercial implementation of the idea that works given the many intervening decades between now and the original conception. Built-in redundancy and routing around the defective cell is an absolute engineering marvel. For certain use cases, e.g., LLMs, the WSE is a compelling value prop.

Yes, if you have ever built a gaming PC, you might have purchased a slightly less performant CPU that came about as a result of binning. Moreover, nowadays, it is not enough to make the chip with minimal attention to packaging - you also have to package it precisely and scalably with more advanced techniques. I will be careful with using terms such as 'chipsets' since they can mean different things to different people. But if you are interested in this space, you should be on the lookout for people build out their advanced packaging capacity...

Expand full comment

I just attended a whole day on die to die interconnect and chiplets were being referred to any smaller piece of what would have been an SoC, serving a particular function like logic, io, power etc.

Expand full comment

Yeah, but it is very unclear what the value proposition is unless you are talking about a specific configuration.

Expand full comment