|
Although the first implementation was not superscalar, the 960 was designed to allow dispatching of instructions to multiple (undefined, but generally including at least one integer) execution units, which could include internal registers (such as the four 80 bit registers in the floating point unit (32, 64, and 80 bit IEEE operations)) - the 960 CA version (1989) was superscalar. There are sixteen 32 bit global registers which can be shared by all excution units and sixteen register "caches" - similar to the SPARC register windows, but not overlapping (originally four banks). It's a load/store Harvard architecture (32-bit flat addressing), but has some complex microcoded instructions (such as CALL/RET). There are also thirty-two 32 bit special function registers.
It's a very clean embedded architecture, not designed for high level applications, but very effective and scalable - something that can't be said for all Intel's processor designs.
The 860 has several modes, from regular scaler mode to a superscalar mode that executes two instructions per cycle and a user visible pipeline mode (instructions using the result register of a multi-cycle op would take the current value instead of stalling and waiting for the result). It can use the 8K data cache in a limited way as a small vector register (like those in supercomputers). The unusual cache uses virtual addresses, instead of physical, so the cache has to be flushed any time the page tables changes, even if the data is unchanged. Instruction and data busses are separate, with 4 G of memory, using segments. It also includes a Memory Management Unit for virtual storage.
The 860 has thirty two 32 bit registers and thirty two 32 bit (or sixteen 64 bit) floating point registers. It was one of the first microprocessors to contains not only an FPU as well as an integer ALU, and also included a 3-D graphics unit (attached to the FPU) that supports lines drawing, Gouraud shading, Z-buffering for hidden line removal, and operations in conjunction with the FPU. It was also the first able to do an integer operation, and a (unique at the time) multiply and add floating point instruction, for the equivalent of three instructions, at the same time.
However actually getting the chip at top speed usually requires using assembly language - using standard compilers gives it a speed closer to other processors. Because of this, it was used as a coprocessor, either for graphics, or floating point acceleration, like add in parallel units for workstations. Another problem with using the Intel 860 as a general purpose CPU is the difficulty handling interrupts. It is extensively pipelined, having as many as four pipes operating at once, and when an interrupt occurs, the pipes can spill and lose data unless complex code is used to clean up. Delays range from 62 cycles (best case) to 50 microseconds (almost 2000 cycles).
The first POWER CPU (POWER1) was implemented using three ICs for the processor - branch, integer and floating point units - plus two or four cache chips, and defined the basic architecture. The branch unit was unusually complex, and contained the program counter, as well as a condition code (CC) register and a loop register. The CC register has eight field sets, the first two reserved for fixed and floating point operations, the seventh (later) for vector operations, and the rest which could be set separately, and combined or checked several instructions later. The loop register is a counter for 'decrement and branch on zero' loops with no branch penalty (similar to certain DSPs like the TMS320C30). POWER1 was also one of the first superscalar CPUs of its generation, the branch unit could dispatch multiple instructions to the two functional unit input queues while itself executing a program control operation (up to four operations at once, even out of order). Speculative branches were supported using a prediction bit in the branch instructions (results discarded before being saved if not taken, the alternate instruction was buffered and discarded if the branch was taken), and the branch unit manages subroutine calls without branch penalties, as well as hardware interrupts. Results are forwarded to instructions in the pipeline which use them before they are written to the registers.
Thirty two 32-bit registers were defined for the POWER1 integer unit, which also included certain string operations, as well as all load/store operations. In addition, it included a special MQ register for extended precision multiply/divides, similar to the MIPS HI/LO registers. Like many other load-store CPUs, register R0 is treated as constant 0 for some instructions, but it is used like a normal register most of the time. The POWER/PowerPC architecture supports memory/data-style 'update' operations, incrementing/decrementing the used address register before a load/store.
The floating point unit had thirty two 64 bit registers, performing only double precision operations, and including a DSP-like multiply-accumulate operation. Floating point exceptions are imprecise - in fact, don't produce exceptions at all, but set a condition bit on an error. The bit must be tested by software to determine if an error occurred.
IBM, Motorola, and Apple formed a coalition (around 1992) to produce a microprocessor version of the POWER design as a successor to both the Motorola 68000 and Intel 80x86, resulting in the PowerPC. The architectural differences began with the elimination of the MQ register, since it would add complexity to possible superscalar versions. This was replaced with separate instructions to calculate the upper and lower parts of a multiplication, which would (with two integer units) execute simultaneously anyway. Division was handled similarly using general registers. In addition, the more complex string operations and three-source instructions were removed, and finally, 32 bit floating point support was added. Dropped POWER instructions were to be emulated in the PowerPC CPUs.
The first PowerPC 601 (1993) was a bridge (considered first generation or G1), and included both POWER and PowerPC features, based strongly on the POWER1, except it had a single 32K cache rather than separate I/D caches. It defined the Motorola 88000 as the standard PowerPC bus. The 603 (1993?, first second generation G2) separated the main functional units further, removing load/store operations from the integer unit (four functional units total - integer, floating point, load/store (using integer registers), branch), and splitting the branch unit into a fetch/branch unit, a dispatch unit, and a completion/exception unit. The 603 also added a rename buffer in the dispatch unit for speculative execution using renamed integer and floating point registers, which are ordered properly by the completion/exception unit, or discarded for mispredicted branches and exceptions. Separate 8K and 16K I/D cache versions were available.
The PowerPC 604 (mid 1995) added dynamic branch prediction using a branch history table, and added two simplified integer units - three integer, two for single-cycle operations, one for multicycle operations such as multiply/divide, plus floating point, load/store and branch, total of six. Four instructions could be dispatched at once The CC register could also be renamed. The PowerPC 620 expanded the 604 design to 64 bits (but with a 'backside' L2 cache bus), and added new 64 bit instructions, but was delivered much later and slower than promised, and was further delayed when it was with drawn for a redesign. The 32 bit PowerPC 750 (G3, early 1998) refined the design and performance, adding a P620-style backside cache bus, but made no other significant changes.
Workstation versions continued with the POWER2 (1993), a high bandwidth design with two floating point load/store units, 256K of data cache, and added 128-bit floating point support and a square root instruction. Initially a multichip design, it was later combined into one chip (P2SC), and then into an eight CPU "SuperChip". It could issue up to six instructions and four simultaneous loads or stores. It was superceded by the POWER3 (Early 1998), with eight functional units (two FPU, three integer (two single cycle, one multicycle), two load/store, and branch unit), but capable of operating at much higher clock speeds. In addition, a 64 bit designed as the CPU for the AS/400 E series, the PowerPC A35 (Apache) with added decimal arithmatic and string instructions, was also used in the RS/6000 S70 workstation (called the PowerPC RS64).
In addition, IBM and Motorola have designed simplified embedded versions, such as the IBM 40x series, and Motorola's 8xx versions, though complexity limits how small the designs can be - for the lower end, Motorola designed the ARM-like MCore low cost/power RISC CPU, while IBM simply licensed the ARM itself.
In direct response to Intel's MMX instructions, AltiVec extensions are planned for fourth (G4) generation PowerPC CPUs from Motorola (but not IBM). Unlike multimedia extensions which use integer (HP PA-RISC MAX) or floating point registers (Sun VIS, Intel MMX), AltiVec adds an entire new set of 128-bit registers (enough for a vector of four 32-bit floating point numbers) and a separate vector execution unit and instruction set (four operand - three source, one result), supported by the complex PowerPC branch unit. That means that operating system software needs to be modified to preserve additional CPU state information (like the MIPS MDMX which adds a 192-bit accumulator to hold intermediate results, but uses 64-bit floating point registers for data), but it allows multimedia instructions to be executed in parallel with both integer and floating point operations, and to reduce the number of registers to save, an additional register (VRSAVE) is added to track which vector registers are being used - unused registers don't need to be stored. In addition to subword vector operations, AltiVec also includes permutation operations along the same lines as PA-RISC MAX instructions, and subword floating point operations like MIPS MDMX which can also perform vector multiplication allowing 3-D graphics support (see Appendix D) like the Hitachi SH4.
It's interesting to note that the AltiVec data formats are based primarily on Java standards (based on IEEE), then on IEEE, and lastly on ANSI C9X floating point standards. A "Java mode" provides strict adherence to these standards, a "Non-Jave mode" relaxes adherence to allow faster operations.
A very high clock rate (500MHz) BiCMOS version called the 704 (based on a simplified 604) was being developed in 1996 by Exponential Technologies, expanding on the type of technology which Intel found necessary to keep its Pentium and Pentium Pro CPUs competitive, but advances in CMOS and a slower initial product (410MHz) sharply reduced the clock speed advantages, cancelling the project (faster, lower power, fully CMOS Pentium and Pentium Pro CPUs have replaced earlier BiCMOS versions). IBM went so far as to produce a 1GHz integer-only demonstration version of a CMOS PowerPC, and used the PowerPC as the first product to replace aluminum conductors with lower resistance copper, boosting clock speeds by about 33%.
Overall, the POWER/PowerPC architecture is a very powerful, almost mainframe-like archetecture which could easily have fit into the "Wierd and Innovative" section, violating the traditional RISC philosophy of simplicity and fewer instructions (with over a hundred, including many duplicate which implicitly set CC bits and other which don't), versus only about 34 for the ARM and 52 for the Motorola 88000 (including FPU instructions)). The complexity is very effective, but has somewhat limited the clock speed of the designs (but less so than the even more complex Intel Pentium and Pentium II designs). It's an interesting tradeoff, considering that a highly parallel 71.5 MHz POWER2 managed to be faster than a 200MHz DEC Alpha 21064 of the same generation.
Alpha is a 64 bit architecture (32 bit instructions) that doesn't support 8- or 16-bit operations, but allows conversions, so no functionality is lost (Most processors of this generation are similar, but have instructions with implicit conversions). Alpha 32-bit operations differ from 64 bit only in overflow detection. Alpha does not provide a divide instruction due to difficulty in pipelining it. It's very much like the MIPS R2000, including use of general registers to hold condition codes. However, Alpha has an interlocked pipeline, so no special multiply/divide registers are needed, and Alpha is meant to avoid the significant growth in complexity which the R2000 family experienced as it evolved into the R8000 and R10000.
One of Alpha's roles is to replace DEC's two prior architectures - the MIPS-based workstations and VAX minicomputers (Alpha evolved from a VAX replacment project codenamed PRISM, not to be confused with the Apollo Prism acquired by Hewlett Packard). To do this, the chip provides both IEEE and VAX 32 and 64 bit floating point operations, and features Privileged Architecture Library (PAL) calls, a set of programmable (non-interruptable) macros written in the Alpha instruction set, similar to the programmable microcode of the Western Digital MCP-1600 or the AMD Am2910 CPUs, to simplify conversion from other instruction sets using a binary translator, as well as providing flexible support for a variety of operating systems.
Alpha was also designed for the future for a 1000-fold eventual increase in performance (10 X by clock rate, 10 X by superscalar execution, and 10 X by multiprocessing) Because of this, superscalar instructions may be reordered, and trap conditions are imprecise (like in the 88100). Special instructions (memory and trap barriers) are available to syncronise both occurrences when needed (different from the POWER use of a trap condition bit which is explicitly by software, but similar in effect. SPARC also has a specification for similar barrier instructions). And there are no branch delay slots like in the R2000, since they produce scheduling problems in superscalar execution, and compatibility problems with extended pipelines. Instead speculative execution (branch instructions include hint bits) and a branch cache are used.
The 21064 was introduced with one integer, one floating point, and
one load/store unit. The 21164 (Early 1995) began expanding instruction
parallelism by adding one integer/load/store
unit with byte vector (multimedia-type) instructions (replacing the
load/store unit) and one floating point unit, and
increased clock speed from 200 MHz to 300 MHz (still roughly twice that
of competing CPUs), and introduced the idea of a level 2 cache on
chip (8K each inst/data level 1, 96K combined level 2). The 21264
(mid 1998) expanded this to four integer units (two
add/logic/shift/branch (one also with multiply, one
with multimedia) and two add/logic/load/store), two different floating
point units (one for add/div/square root and one for multiply), with the
ability to load four, dispatch six, and retire eight instructions per
cycle (and for the first time including 40 integer and 40 floating
point rename registers
and out of order execution), at up to 500MHz.
The 21364 (expected 2000 or 2001) began the multiprocessor strategy by
adding five high speed interconnects (four CPU (10 GB/s), almost
identical in concept to the Transputer
CPUs, and one I/O (3 GB/s)) to an enhanced 21264 core.
Multimedia extensions introduced with the 21264 are simple, but
include VIS-type
motion estimation (MPEG).
DEC's Alpha is in many ways the antithesis of
IBM's POWER design, which gains performance from
complexity, and the expense
of a large transistor count, while the Alpha concentrates on the
original RISC idea of simplicity and a higher clock rate - though that
also has its drawback, in terms of very high power consumption.
|
Best viewed with Netscape Communicator
![]()
Last update: October 1998
The most recent version of this paper you could find at its original location