Sounds about right. It's been decades since I've done those, but yes, each grows as more bits are added. A 2-bit adder needs 7 logic gates, a 4-bit adder needs 17, an 8-bit adder needs 37, etc. Since the processor is doing more than just adding, you need circuitry for each variation, and you need circuitry to choose which sub-circuit to redirect the current to.
Intel tends to be quiet about numbers and many implementation details. The Apple M1 has 16 billion transistors, M1 Ultra has 114 billion, and the M2 Ultra has 134 billion transistors. That kind of complex mesh for all the various operations require a tremendous amount of logic gate circuitry to get work done.
I have a few acquaintances who do VLSI work, basically running the software that tries to reroute and reorganize the billions of transistors on chips to efficiently compute all the traces on the hardware, and as I understand it, it's basically as you describe. Any student can put together the circuits for a 4-bit adder and that's a typical assignment for anybody going into computing hardware, and many software devs. 8-bit and even 16-bit patterns for a few operations can also be done by hand although it is tedious. But once you're into 32-bit, 64-bit, and even bigger 256-bit and 512-bit vector operations, you're looking at software determining the thousands of circuitry steps for each operation.
Many of the design decisions in processors go toward those requirements.
CPUs these days have phases, an early one is to decode instructions into micro-operations or sub-steps of the bigger operation so common circuit paths can be reused. Decoding and reordering are separate, dedicated steps as each type of operation gets sorted out over the hardware. Even more, instructions are broken down and divided across ‘execution ports’ where each port does a different subset of operations, reducing the overall needs internally. That's partly why code optimization includes mixing operations like load, multiplies, and stores rather than doing 100 loads and saturating the port dedicated to loads, then 100 multiply operations saturating the execution port doing the multiplication, followed by 100 stores saturating the execution port doing that work. Even something like matrix multiply splits the work between multiply and addition so multiple compute ports can stay busy while hopefully some other thread is using the load/store ports.
Each chip iteration has their own breakdown of what gets done where. According to intel, for the current Golden Cove lineup the 12 ports are broken out, 3 load data (ports 2, 3, and 11), 2 store data (ports 4 and 9), 2 store addresses (ports 7 and 8), and five do different subsets of math operations (ports 0, 1, 5, 6, and 10). They even break it down by which sets of instructions get handled by different ports, basic operations like add get handled on 5 of them because the operation is common and relatively easy, but multiply, division, square root, and other complex operations are only on one of them. That further reduces the need for all the chip to know how to do all the work. Instead the work gets farmed out to subsections of the chip, and that subsection has the complex circuitry, then the result is signaled back.