Advertisement

Processor architecture

Started by December 30, 2024 02:25 PM
2 comments, last by Calin 1 month ago

I would like to find out more about how the CPU's are functioning. Just the basic idea. Performing a mathematical operation on two bits takes more then 10 boolean gates (at least addition takes that much). 10 interconnected boolean gates make a farm field. A 32 bit number vs another 32 bit number requires 32 fields of 10 gates each. That's one math operation. Four types of math operations is four time that: 4x32x10. (There might be other non basic math operations like square root which have dedicated processor real estate).

Selecting the operation type and routing the bits towards the required field is probably way less boolean gates, you talk comparisons (you have to compare the operation code/name so you can decide which field will do the operation), that's one AND gate for 1 bit vs 1 bit. Am I off by much?

My project`s facebook page is “DreamLand Page”

Sounds about right. It's been decades since I've done those, but yes, each grows as more bits are added. A 2-bit adder needs 7 logic gates, a 4-bit adder needs 17, an 8-bit adder needs 37, etc. Since the processor is doing more than just adding, you need circuitry for each variation, and you need circuitry to choose which sub-circuit to redirect the current to.

Intel tends to be quiet about numbers and many implementation details. The Apple M1 has 16 billion transistors, M1 Ultra has 114 billion, and the M2 Ultra has 134 billion transistors. That kind of complex mesh for all the various operations require a tremendous amount of logic gate circuitry to get work done.

I have a few acquaintances who do VLSI work, basically running the software that tries to reroute and reorganize the billions of transistors on chips to efficiently compute all the traces on the hardware, and as I understand it, it's basically as you describe. Any student can put together the circuits for a 4-bit adder and that's a typical assignment for anybody going into computing hardware, and many software devs. 8-bit and even 16-bit patterns for a few operations can also be done by hand although it is tedious. But once you're into 32-bit, 64-bit, and even bigger 256-bit and 512-bit vector operations, you're looking at software determining the thousands of circuitry steps for each operation.

Many of the design decisions in processors go toward those requirements.

CPUs these days have phases, an early one is to decode instructions into micro-operations or sub-steps of the bigger operation so common circuit paths can be reused. Decoding and reordering are separate, dedicated steps as each type of operation gets sorted out over the hardware. Even more, instructions are broken down and divided across ‘execution ports’ where each port does a different subset of operations, reducing the overall needs internally. That's partly why code optimization includes mixing operations like load, multiplies, and stores rather than doing 100 loads and saturating the port dedicated to loads, then 100 multiply operations saturating the execution port doing the multiplication, followed by 100 stores saturating the execution port doing that work. Even something like matrix multiply splits the work between multiply and addition so multiple compute ports can stay busy while hopefully some other thread is using the load/store ports.

Each chip iteration has their own breakdown of what gets done where. According to intel, for the current Golden Cove lineup the 12 ports are broken out, 3 load data (ports 2, 3, and 11), 2 store data (ports 4 and 9), 2 store addresses (ports 7 and 8), and five do different subsets of math operations (ports 0, 1, 5, 6, and 10). They even break it down by which sets of instructions get handled by different ports, basic operations like add get handled on 5 of them because the operation is common and relatively easy, but multiply, division, square root, and other complex operations are only on one of them. That further reduces the need for all the chip to know how to do all the work. Instead the work gets farmed out to subsections of the chip, and that subsection has the complex circuitry, then the result is signaled back.

Advertisement

It's a wild transistor world. Thank you frob.

My project`s facebook page is “DreamLand Page”

This topic is closed to new replies.

Advertisement