|
|
|
What
is Delaying the Adoption of EPIC Computers Like Itanium?
The first computers were nothing more than a few hundred gates so
all they could implement was a "reduced instruction set". As technology
improved the instruction sets grew and we entered an era of CISC
"Complex Instruction Set Computers". The problem with CICS is they
need a lot of circuitry to decode their instructions so designers
went back to basics and built modern "Reduced Instruction Set Computers"
or RISC. Ten years ago the battle between CICS and RISC was supposed
to end and the winner was supposed to be EPIC computers like the
Intel Itanium.
Simple Computers
When computers were built from simple components (tubes, transistors,
and simple integrated circuits) every "bit" was expensive. Every
component also added its own probability of failure so large complex
designs had failure rates too high to make them commercially viable.
Instruction sets were very limited and commercial computers would
have only 10 to 20 instructions. Programmers understood these instructions
so well they remembered the hex values and could write in machine
code. Instruction sets grew in size and complexity as the semiconductor
industry learned how to put more transistors on an integrated chip.
Simple computers have not gone way, they are now called embedded
computers and they live on in TV remotes, microwave ovens, automobiles,
toys and practically anything with a keypad. Soon these simple computers
will show up as RFID "smart dust" that replace bar codes on everything
you buy. Programmers who can program in hex will never be without
a job.
Complex Instruction Set Computers (CICS)
Moore's Law drove Large Scale Integration (LSI) to Very Large Scale
Integration (VLSI) and soon there were several million transistors
available on the CPU chip to implement more complex instructions.
Memory chips also used the same technology but several million transistors
only provided systems with a few megabytes. Memory systems were
slow because they needed lots of interconnected chips for capacity
and they needed to spread them out to keep them cool. CPU designers
had to be very careful with memory use and when they fetched an
instruction from memory they wanted it to do a lot before they had
to get the next instruction.
The solution was a large, complex instruction set. Current instruction
sets are so complete that simple programming languages like C or
Fortran only need one or two CPU instructions for every line of high
level language. The down side is the instruction set is too large
for the average programmer to remember so to get the most from a
CICS you need a high level language compiler. Programmers writing
compilers have to know all the instructions but everyone can forget
how to program in assembler or machine code.
Microcoded CPUs
The complex instructions required every more complex decode and
execution units on the CPU chip and just designing and testing all
the connections became unmanageable. The solution was a CPU within a CPU;
a simple CPU with its own memory and simple instruction set that
implemented the complex instructions. The instruction "add register
one to register two" becomes a little program that runs on the little
CPU on the chip. This way designer could add new even more complex
instructions by just writing little programs in the microcoding
language used by the little CPU. All CISC chips are now microcoded
and some (not Intel) will actually let you write and load microcode
so you can add new instructions to the CPU.
Reduced Instruction Set Computers (RISC)
When advances in semiconductors increased the capacity and speed
of memory chips the memory system was no longer a bottleneck. Now
CPU designers could throw away all the microcoding circuitry and
drive the little microcoded CPU directly. More important without
all that extra circuitry they could make the chips smaller and increase
the clock speed. Early RISC systems had clock speeds three or four
times faster than CISC systems. They needed more of the simple instructions
to do useful work but even with the extra memory fetch cycles the
computer system was still two or three times faster for the same
price.
Pipelining was one more advantage of a RISC CPU. Since there are
a limited number of simple instructions the program flow is more
predictable and you can do the fetch, decode, execute functions
of the CPU in parallel. This depends on the fact that most programs
have long stretches of code without any branching. If you hit a
branch (is z = 0 ?) you have to flush the pipeline and figure out
which path to take before you start pipelining again.
The RISC designers have one more advantage; since the instruction
decode and execution units are simple you can put a bunch of them
on the CPU chip. When the program hits a branch you can just load
both paths in parallel and throw away the results from the path
not taken. As the RISC CPU reads ahead it creates a tree of these
code branches. Some CPUs have enough extra units to manage a tree
with four or five branches before they have to wait for the results
of the first branch. Keeping track of this branching table and the
status of all the units requires a little administrative CPU. At
some point the administrative CPU is doing more work than the user's
program so this strategy has a limit.
RISC vs. CISC a Battle of Heat and
Light
If there were no other limits a RISC CPU would always be faster
than a CISC CPU for a given semiconductor technology. At higher
clock speeds there are serious problems with providing clean power
and removing heat from the active areas of the chip surface. When
the clock speeds get to the gigahertz range the speed of light starts
to limit how long a signal takes to cross the chip surface. With
these limits you cannot just increase the clock speed for a RISC
CPU to give it an advantage over a CICS CPU. At very high clock
speeds the RISC CPU is forced to run at the same speed as the CISC
CPU and the RISC advantage disappears entirely. While the battle
raged the CICS designers were not sleeping, they were optimizing
their microcode so common instructions had fewer lines in their
microprogram and they were implementing pipelining. The net result
is CICS CPUs have the advantage at high clock speeds.
Explicitly Parallel Instruction Set
Computers (EPIC)
The battle between RISC and CISC had created a number of advances
in CPU design that made EPIC CPUs possible. The extra decode and
execution units in RISC CPUs were able to do parallel actions. The
user of the RISC CPU only saw one thread of execution but the little
administrative CPU on the RISC chip was managing multiple parallel
threads.
What if the administrative CPU on the RISC chip ran microcode like
the little CPU on the CISC chip? The user of the CPU can explicitly
control all the parallel units from a high level language. In one
clock tick the CPU can be executing several independent threads
of parallel activity. This means raw clock speed is no longer the only limit
to raw CPU speed. The CPU speed is now the clock speed times the
number of parallel threads or "issues".
A cheap six issue one gigahertz CPU can run as fast as an expensive
six gigahertz CPU. If you want more parallel threads you just add
more units. If you have a hundred floating point units on the chip
surface you could organize them as an array processor and your Excel
spreadsheet runs blindingly fast. If you have a hundred bit-blitter
units you can move pixels around on your computer screen faster
than taking the time to load the problem into a separate video card.
So When Do I Get One?
If EPIC computers have so many advantages why don't I have one on
my desktop? The simple answer is they are a very difficult to build
and program properly. In 1989 the now defunct Digital Equipment
built a 64 bit EPIC computer called an Alpha. They had the hardware
for two years before they could to get a compiler that worked. When
they started shipping Alphas in 1992 they would only do one "issue".
By the next year they had a two issue compiler, and suddenly all
the old Alphas ran twice as fast. Both HP and Intel had EPIC design
teams but Digital was well ahead of the pack in shipping a commercial
product. In 1994 HP dropped their EPIC project in favour of the
Intel IA-64 design. In 1997 as part of long painful decline of missed deadlines
and problems trying to build faster EPIC designs Digital sold its
Alpha technology and chip foundary to Intel for $700 million. Intel
in still makes six issue Alphas but HP stopped selling new
systems after October 26, 2006.
If Intel had Alpha technology in 1997 why aren't they shipping
EPIC computers in volume in 2004? The major reason is they thought
EPIC was easy and they wanted to ship a "made in Intel" design.
Like Digital, Intel found it was harder than it looked. The problem is
an Itanium is more than just a bunch of decode and execute units - it is
a very complex machine.
The latest generation of Itanium 2 systems are finally showing the promised performance.
Meanwhile AMD has added 32 extra bits to the old IA-32
architecture and are selling 64 bit 8086 chips. It looks like Intel
has given in and has done the same and added 32 bit to their Xeon
chips. It is very likely the distraction of 64 bit 8086 chips will
once again delay volume shipments Itanium chips.
2005 Update - Cell Architecture Trumps EPIC
The main advantage of EPIC computers is the number of "issues" or threads of activity
they can run. An eight issue EPIC with lots of work packets to keep it busy could run eight times
as fast as a single threaded RISC or CISC chip. Now IBM, Sony and Toshiba have announced
a Cell processor technology
that gets multiple treads of activity from slave CPUs managed by a modified PowerPC master.
The PowerPC actually runs two treads of activity and the eight slaves (called SPEs or
Synergistic Processing Elements) can each run one thread. The PowerPC is only responsible
for starting and stoping activity on the SPEs so in a well behaved application the PowerPC
threads can also do useful work.
One ominous feature is that each of the SPEs has its own serial number (GUID for
global unit identification) which it will need to do
GRID processing. GRID processing allows
computational work packages to be scheduled on any CPU registered with the GRID. The
result is a massively parallel computer which could be scattered across the internet
(like the SETI project). Registration requires CPUs have a unique identifier so each
of SPEs has its own serial number to register with the GRID. Of course these serial
numbers could also be used to register software licenses or even track the activities
of the owner. There was a huge protest when Intel started to put serial numbers in
its pentiums chips - so much so that Intel backed down and removed the serial numbers.
For a detailed look at Cell on Wikipedia
provides an excellent analysis. The chips (with only two SPEs) should first start showing up in Sony Playstation 3
units under the Christmas tree in 2006. Games developers and GRID academics are likely to get Cell
workstations first and news about Cell Linux should start to appear as a regular headline.
2006 Update - More Delays for Intel EPIC - Cell Pulls Ahead
The Itanium project remains stalled at a two issue core. The Tanglewood project to create a four issue core has
been delayed until at least 2008. As another sign of bad news the Tanglewood project has been renamed Tukwila.
Troubled projects often try to reinvent themselves with a name change long before a product release and this
appears to be the case with Tanglewood.
On the Cell processor front Linus Torvalds has included Cell support with the Linux kernel version 2.6.16 so
it should quickly show up in your favorite Fedora, Debian, SuSe distributions. With the Cell being multi-threaded,
cheap, simple and supported by many vendors it probably means the EPIC design that started with the DEC Alpha
will fade into the museum of dead end architectures.
|