Inside Intel's Penryn
Published: 12 Nov 2007
Architectural tweaks
During the design of Penryn, Intel tweaked this predictive behaviour primarily for power consumption — again, resulting in a far more relaxed thermal envelope when going for speed increase compared to previous designs. This also allows a trick that Intel calls Enhanced Dynamic Acceleration Technology — known to the rest of us as overclocking. If one core in a multicore design is quiescent, there's enough slack in the power budget for the entire chip for the other core to be accelerated past its normal limits. That's particularly useful for single-threaded applications, which still make up the bulk of software.

Enhanced Dynamic Acceleration Technology allows an idle core to be powered down and an active core to be overclocked.
Power savings come with performance enhancements in other ways too. The chip can spot when loops habitually generate cache misses, for example, and does a speculative pre-fetch ahead of time to populate the cache with the data in case it happens again. That's tied in with a dynamically resized window that trades off memory bandwidth against latency. If the memory bus is quiet, the window increases to soak up the spare bandwidth just in case; if the bus is very busy, the chip fetches less to reduce the load, at the risk of having to go back if it misses data that's needed later.
Lots of this behaviour is fine tuned in the microcode, the internal program that defines how the different parts of the processor work together. This can be configured differently for different parts, with different trade-offs made for mobile and server parts. This is entirely invisible to the operating system and applications that run on the chip, but makes a difference to performance and power consumption.
There are plenty of straightforward design tweaks too. For example, Intel's basic mathematical divider circuit has remained unchanged since the original Pentium: it works by long division, and acts on two bits of the quotient at a time. That's been upped to 4 bits for the Penryn divider, which effectively doubles the speed. Also, processors are notoriously bad at handling data that's not nicely stored in memory — that is, if it doesn't start and stop on convenient memory boundaries. This misalignment needs a lot of juggling to be efficiently handled, and Penryn's designers spent a lot of time tuning the chip to work well with what they call 'junk code'.
Virtualisation performance is also improved, by the chip maintaining a better internal track of the states of the virtual machines and maintaining a more precise control of just those parts of the circuit that need to change when virtual states are entered and exited. Intel says that performance is better by between 25 and 75 percent for some instructions, although the company is still working out exactly how to benchmark virtualisation.
Related articles
Benchmarks: Intel's first 45nm Penryn chip
Tech Guide The 3GHz Core 2 Extreme QX9650, Intel's first 45nm processor, has a total of 12MB of Level 2 cache at its disposal. This benchmark test shows what else the new chip has to offer. [11 Nov 2007]
Intel Core 2 Extreme QX9650
Review The CPU market is due for a lot of upheaval over the next 12 months, so you might be wise to wait for a clearer picture before plunking down $1000 or so on Intel's new Core 2 Extreme QX9650 quad-core desktop processor. But if you want to claim ownership of the fastest multi-core CPU available today, look no further. [29 Oct 2007]














