Taking advantage of new hardware has always required changing programs to make better use of the hardware. A Pentium 4 wasn’t just a faster Pentium Pro. It had a different architecture, new instructions, different latencies and throughputs for various instructions, and vector processing extensions. To make full use of the P4′s power, people definitely had to modify their code, all the way down to the assembly level. In fact early in the release cycle there were reports of many programs actually running slower on P4s than P3s under certain conditions. Software developers and compiler designers had to force themselves to use the new and largely unfamiliar MMX/SSE instruction sets to get the most out of those new chips.
But all of this is just part of the broader trend of hardware and software evolving together. Univac programs wouldn’t be very good fits for the power of the 386, for instance. Our programming practices have evolved greatly during the last several decades. One example is that x86 programmers had to learn how to write efficient programs using a relatively small number of registers and a small number of memory accesses. This was something of a handicap, as programmers were used to memory access being roughly the same speed as register access (maybe 2-4 times slower), rather than 10 or 20 times slower (or more!) as they were on later x86 architectures. This forced the development of cache-aware algorithms and so on.
And where fast performance is really needed (HPC and servers), software developers have always had to modify their code, usually for every single chip iteration. It is very uncommon, for instance, for code written on one supercomputer to run perfectly well on a new supercomputer, without updating the code and compilers.
Anyway, it’s not just misleading to say Moore’s law has broken down for the past 6-8 years, it’s flat-out wrong. Moore’s law is about the number of transistors that can fit on a ‘single chip’, and it has indeed been going strong and keeps going strong:
When you shrink the transistors down, the distance between them gets smaller, thus you can run circuits at higher speed. This is still true today. The problem is this: In the past, when you shrunk transistors down, you could lower their operating voltage without sacrificing error rate. This is no longer the case, and it seems to be a limitation of silicon CMOS tech, and not a limitation of photolithography techniques. Thus today we have chips that are in principle capable of operating easily at up to 10 GHz or more, but they would dissipate impractical levels of power while doing so. So chips are run at speeds far lower than they can theoretically run at. The cure for this problem is to try to do the same amount of work with fewer transistors, even if it means slightly slower speeds. The payoff of using 2x fewer transistors or a task more than outweighs the disadvantage of having it run 2x slower (A 2x reduction in chip frequency/voltage results in more than 2x reduction in power usage). This is in many ways opposite of the trend we had in the late 1990′s and early 2000′s. Thus we now have hybrid architectures that use a huge number of very simple, low-transistor-count cores (like nVidia’s CUDA cores or Intel’s MIC architecture) that run at slow speed, but with very high parallelism. These architectures have made computing MUCH faster. The price of this speed increase is that now mainstream computer hardware has become parallel. So, non-HPC programmers now have to deal with issues that were traditionally only reserved for HPC programmers. Thus the huge amount of tension and anxiety that we now see in mainstream programming.
Essentially, as computing has been getting better, the average cpu you have in your laptop has become more and more like what a supercomputer of 20 years ago used to look like. As a result, it has inherited the same programming difficulties.
Taking advantage of new hardware has always required changing programs to make better use of the hardware. A Pentium 4 wasn’t just a faster Pentium Pro. It had a different architecture, new instructions, different latencies and throughputs for various instructions, and vector processing extensions. To make full use of the P4′s power, people definitely had to modify their code, all the way down to the assembly level. In fact early in the release cycle there were reports of many programs actually running slower on P4s than P3s under certain conditions. Software developers and compiler designers had to force themselves to use the new and largely unfamiliar MMX/SSE instruction sets to get the most out of those new chips.
But all of this is just part of the broader trend of hardware and software evolving together. Univac programs wouldn’t be very good fits for the power of the 386, for instance. Our programming practices have evolved greatly during the last several decades. One example is that x86 programmers had to learn how to write efficient programs using a relatively small number of registers and a small number of memory accesses. This was something of a handicap, as programmers were used to memory access being roughly the same speed as register access (maybe 2-4 times slower), rather than 10 or 20 times slower (or more!) as they were on later x86 architectures. This forced the development of cache-aware algorithms and so on.
And where fast performance is really needed (HPC and servers), software developers have always had to modify their code, usually for every single chip iteration. It is very uncommon, for instance, for code written on one supercomputer to run perfectly well on a new supercomputer, without updating the code and compilers.
Anyway, it’s not just misleading to say Moore’s law has broken down for the past 6-8 years, it’s flat-out wrong. Moore’s law is about the number of transistors that can fit on a ‘single chip’, and it has indeed been going strong and keeps going strong:
http://en.wikipedia.org/wiki/File:Transistor_Count_and_Moore%27s_Law_-_2011.svg
When you shrink the transistors down, the distance between them gets smaller, thus you can run circuits at higher speed. This is still true today. The problem is this: In the past, when you shrunk transistors down, you could lower their operating voltage without sacrificing error rate. This is no longer the case, and it seems to be a limitation of silicon CMOS tech, and not a limitation of photolithography techniques. Thus today we have chips that are in principle capable of operating easily at up to 10 GHz or more, but they would dissipate impractical levels of power while doing so. So chips are run at speeds far lower than they can theoretically run at. The cure for this problem is to try to do the same amount of work with fewer transistors, even if it means slightly slower speeds. The payoff of using 2x fewer transistors or a task more than outweighs the disadvantage of having it run 2x slower (A 2x reduction in chip frequency/voltage results in more than 2x reduction in power usage). This is in many ways opposite of the trend we had in the late 1990′s and early 2000′s. Thus we now have hybrid architectures that use a huge number of very simple, low-transistor-count cores (like nVidia’s CUDA cores or Intel’s MIC architecture) that run at slow speed, but with very high parallelism. These architectures have made computing MUCH faster. The price of this speed increase is that now mainstream computer hardware has become parallel. So, non-HPC programmers now have to deal with issues that were traditionally only reserved for HPC programmers. Thus the huge amount of tension and anxiety that we now see in mainstream programming.
Essentially, as computing has been getting better, the average cpu you have in your laptop has become more and more like what a supercomputer of 20 years ago used to look like. As a result, it has inherited the same programming difficulties.