|
Accumulator
|
Accumulator
+Index
/ \
Stack...Memory- (CISC)
Data
|
Load- (RISC)
Store
At that time, RISC and "load-store" were often synonymous, but
RISC usually referred to a list of features:
By contrast, the CISC philosophy has been that if added hardware can result in an overall increase in speed, it's good - the ultimate goal of mapping every high level language statement on to a single CPU instruction. The disadvantage is that it's harder to increase the clock speed of a complex chip. The PowerPC is a good example of this idea applied to a load-store architecture.
A two stage pipeline was first introduced in the IBM 3033 (1977). Instructions are fetched from the cache into three 32 bit buffers. The Instruction Pre-Processing Function (IPPF) then decodes them, generates operand addresses and stores them in operand address registers, and places source operands in operand buffers. Decoded instructions were placed into a 4 entry queue until the execution unit was ready.
In some models (such as 360/91) when a conditional branch occurs, the most likely next instruction is loaded into the IPPF buffer, but the previous next instruction is not discarded, so either can be executed without penalty. Two speculative branches can be buffered this way. Some had a "loop mode" like the Motorola 68010.
Addressing was originally 24 bit, but was extended to 31 bits (the high bit indicated whether to use 24 or 32 bits) with the XA architecture (This caused problems with software which stored type information in the unused 8 bits of a 32 bit word. The same thing happened when the Motorola 68000 was expanded from 24 to 32 bit addressing). The S/360 used completely position independent (register+offset and register+index) addressing modes. Virtual memory was added in the S/370, and used a segment and paging method - the first 8 bits of an address indicated an entry in a segment table which is added to the next 4 or 8 bits to get the page table index which contains the upper (12 or 20) bits of the physical memory address, and the rest of the address provides the lower 12 bits (the Intel 80386 uses a similar method, while the Motorola 68030 uses fixed length logical/physical pages instead of variable length segments).
Like the DEC VAX, the S/370 has been implemented as a microprocessor. The Micro/370 discarded all but 102 instructions (some supervisor instructions differed), with a coprocessor providing support for 60 others, while the rest are emulated (as in the MicroVAX). The Micro/370 had a 68000 compatible bus, but was otherwise completely unique (some legends claim it was a 68000 with modified microcode plus a modified 8087 as the coprocessor, others say IBM started with the 68000 design and completely replaced most of the core, keeping the bus interface, ALU, and other reusable parts, which is more likely).
More recently, with increased microprocessor complexity, a complete S/390 superscalar microprocessor with 64K L1 cache (at up to 350MHz, a higher clock rate than the 200MHz Intel's Pentium Pro available at the time) has been designed.
The VAX was a 32 bit architecture, with a 32 bit address range (split into 1G sections for process space, process specific system space, system space, and unused/reserved for future use). Each process has its own 1G process and 1G process system address space, with memory allocated in pages.
It features sixteen user visible 32 bit registers. Registers 12 to 15 are special - AP (Argument Pointer), FP (Frame Pointer), SP and PC (user, supervisor, executive, and kernal modes have separate SPs in R14, like the 68000 user and supervisor modes). All these registers can be used for data, addressing and indexing. A 64 bit PSL (Program Status Longword) keeps track of interrupt levels, program status, condition codes, and access mode (kernal (hardware management), executive (files/records), supervisor (interpreters), user (programs/data)).
The VAX 11 featured an 8 byte instruction prefetch buffer, like the 8086, while the VAX 8600 has a full 6 stage pipeline. Instructions mimic high level language constructs, and provide dense code. For example, the CALL instruction, which not only handles the argument list itself, but enforces a standard procedure call for all compilers. However, the complex instructions aren't always the fastest way of doing things. For example, the INDEX instruction was 45% to 60% faster when by replaced by simpler VAX instructions. This was one inspiration for the RISC philosophy.
Further inspiration came from the MicroVAX (VAX 78032) implementation, since in order to reduce the architecture to a single (integer) chip, only 175 of the 304 instructions (and 6 of 14 native data types) were implemented (through microcode), while the rest were emulated - this subset included 98% of instructions in a typical program. The optional FPU implemented 70 instructions and 3 VAX data types, which was another 1.7% of VAX instructions. All remaining VAX instructions were only used 0.2% of the time, and this allowed MicroVAX designs to eventually exceed the speed of full VAX implementations, before being replaced by the Alpha architecture.
The CDC 6600 was a 60-bit machine ('bytes' were 6 bits each), with an 18-bit address range. It had eight 18 bit A and 18 bit B (address) and eight 60 bit X (data) registers, with useful side effects - loading an address into A1, A2, A3, A4 or A5 caused a load from memory at that address into registers X1, X2, X3, X4 or X5. Similarly, A6 and A7 registers had a store effect on X6 and X7 registers - loading an address into A0 had no side effects. As an example, to add two arrays into a third, the starting addresses of the source could be loaded into A2 and A3 causing data to load into X2 and X3, the values could be added to X6, and the destination address loaded into A6, causing the result to be stored in memory. Incrementing A2, A3, and A6 (after adding) would step through the array. Side effects such as this are decidedly anti-RISC, but very nifty. This vector-oriented philosophy is more directly expressed in later Cray computers.
Only one instruction could be issued per cycle, but multiple independent functional units in the CDC 6600 meant instruction execution in different units could overlap (a scoreboard register prevented instructions from issuing to a unit if the operands weren't available). The units weren't pipelined until the CDC 7600 (1969), at which point instructions could be issued without waiting for operands (they would wait for them in the functional unit if necessary). Compared to the variable instruction lengths of other machines, instructions were only 15 or 30 bits, packed within 30 bit half-words (a 30 bit instruction could not occupy the upper 15 bit "parcel" of one half-word and the lower 15 bits of the next, so the compiler would insert NOPs to align instructions) to simplify decoding (a RISC-like feature). Branches were 60-bit-word aligned. Like the DEC Alpha, there were no byte or character operations, until later versions added a CMU (Compare and Move Unit) for character, string and block operations.
The 801 had thirty two 32 bit registers, but no floating point unit/registers, and no separate user/supervisor mode, since it was an experimental system - security was enforced by the compiler. It implemented Harvard architecture with separate data and instruction caches, and had flexible addressing modes.
IBM tried to commercialise the 801 design starting in 1977 (before RISC workstations first became popular) with the ROMP CPU (Research OPD (Office Products Division) Mini Processor), 1986, first chips early as 1981) used in the PC/RT workstation, but it wasn't successful. Originally designed for wordprocessor systems, changes to reduce cost included eliminating the caches and Harvard architecture (but adding 40 bit virtual memory), reducing registers to sixteen, variable length (16/32 bit) instructions (to increase instruction density), and floating point support via an adaptor to an NS32081 FPU (later, a 68881 or 68882 were available). This allowed a small CPU, only 45,000 transistors, but an average instruction took around 3 cycles.
The 801 itself morphed into an I/O processor for the IBM 3090 mainframes
This wasn't the only innovative design developed by IBM which never saw daylight. Slightly earlier (around 1971?) the Advanced Computer System pioneered superscalar (seven issue) design, speculative execution, delayed condition codes, multithreading, imprecise traps and instruction streamed interrupts, and load/store buffers, plus compiler optimisation to support these features. It was expensive and incompatible with the System/360, so was not pursued, but many ideas did find its way into the expensive high end mainframes.
The Berkeley project also produced an instruction cache with some innovative features, such as instruction line prefetch that identified jump instructions, frequently used instructions compacted in memory and expanded upon cache load, multiple cache chips support, and bits to map out defective cache lines.
The Stanford MIPS project was the basis for the MIPS R2000, and like the case with Berkeley project, there are close similarities. MIPS stood for Microprocessor without Interlocked Pipeline Stages, using the compiler to eliminate register conflicts. Like the R2000, the MIPS had no condition code register, and a special HI/LO multiply and divide register pair.
Unlike the R2000, the MIPS had only 16 registers, and two delay slots for LOAD/STORE and branch instructions. The PC and last three PC values were tracked for exception handling. In addition, instructions were 'packed' (like the Berkeley RISC), in that many instructions specified two operations that were dispatched in consecutive cycles (not decoded by the cache). In this way, it was a 2 operation VLIW, but executed sequentially. User assembly language was translated to 'packed' format by the assembler.
Being experimental, there was no support for floating point operations.
SOAR (Smalltalk On A RISC) modified the RISC II design to support Smalltalk.
Complex/ Simple/
CISC____________________________________________________________RISC
| 14500B*
4-bit | *Am2901
| *4004
| *4040
8-bit | 6800,650x *1802
| 8051* * *8008 * SC/MP
| Z8 * * *F8
| F100-L* 8080/5 2650
| * *NOVA * *PIC16x
| MCP1600* *Z-80 *6809 IMS6100
16-bit| *Z-280 *PDP11 80C166* *M17
| *8086 *TMS9900
| *Z8000 *65816
| *56002
| 32016* *68000 ACE HOBBIT Clipper R3000
32-bit|432 [3] 96002 *68020 * * * * *29000 * *ARM
| * *VAX * 80486 68040 *PSC i960 *SPARC *SH
| Z80000* * * TRON48 PA-RISC
| PPro Pent* [1]---*------- * *88100
| * * [2]--<860>-*--*----- * *88110
64-bit|Rekurs POWER PowerPC * CDC6600 *R4000
| 620* U-SPARC * *R8000 *Alpha
| R10000
[1] - About here, from left to right, the Swordfish and 68060.Okay, an explanation. Since this is only a 2-dimensional graph, and I want to get a lot more across than that allows, design features 'pull' a CPU along the RISC/CISC axis, and the complexity of the design (given the number of bits and other considerations) also tug it - thus the much of the POWER's RISC-ness is offset by its inherently complex (though effective) design. And it also depends on my mood that day - hey, it's ultimately subjective anyway.
Because virtual machines have to be mapped on to the widest range of hardware possible, they have to make as few assumptions as they can (such as number of CPU registers in particular). This is the main reason why most virtual machines are stack based designs - almost all processors can implement one or two stacks fairly easilly.
The inverse isn't true. Some programming languages are based entirely on stack operations (Forth), but most are based on stack frames (C, Pascal, and their common ancestor ALGOL), or patternless memory access (FORTRAN, Smalltalk). Forth processors are effective because of the simplicity which comes from eliminating non-Forth features, but implementing a stack frame can be a real headache.
The Forth virtual machine contains two stacks. The first is the data stack, which consists of 16 bit entries (double entries can hold 32 bit values). The second is the return stack, used to hold PC values during subroutines.
The Forth equivalent to an instruction is a 'word', and can either be a predefined operation, or a programmer defined word made up of a sequence of executable words (the Forth version of subroutines, similar to Smalltalk). Forth also allows a word to be deleted with the "forget" word, normally only used for interactive Forth development (the language INTERCAL also includes a FORGET statement, but it is used for more evil purposes). Operations typically pull operands from the stack and push the results back onto it, which reduces instruction size since operands don't need to be specified. A subroutine is called by pushing the operands and executing the subroutine word, which leaves the results in the stack.
Operations can be either 16 bit or 32 bit, but there are two cases where types can be mixed - mixed multiplication will multiply two 16 bit numbers and leave a 32 bit result on the stack, while mixed division will divide a 16 bit (top of stack) number into a 32 bit number, producing a 16 bit quotient and 16 bit remainder (note that these two operations are directly supported by the PDP-11 architecture). There are I/O instructions as well.
The Forth two-stack machine has been implemented in the M17 CPU, among many others. The Transputer is stack oriented to a lesser extent (single evaluation stack only), and provides direct memory access abilities (for stack frames and other structures) without penalty.
As for Forth, although it has dedicated advocates, it's explicit stack orientation and its lack of modularity limit the scale of Forth programs. One of the largest Forth efforts was an integrated operating system called Valdocs (Valuable Document System) on the Epson QX-10. The software remained buggy couldn't be updated quickly enough for the machine to remain competitive - although you could just as easilly blame the computer's Z-80 processor (since at the time the 8088 based IBM PC and 68000 based Apple Macintosh were being introduced) and difficulty in finding experienced Forth programmers. Whatever the cause, this soured the acceptance of Forth for large scale projects.
Pascal, like Algol and C, is a stack frame oriented language, and so the p-Machine is a stack oriented machine. Memory is arranged from the top down as follows: p-System operating system code, system stack (growing down), relocatable p-Code pool, system heap (growing up), a series of process stacks as needed (growing down), a series of global data segments, and the p-Machine interpreter. The code pool contains compiled procedure segments in a linked list. Segments can be swapped into and out of memory, and relocated - if the stack needs more space to grow, the highest code segment can be relocated below the code pool. Similarly if the heap needs more space, code segments can be relocated upwards, and if both stack and heap need memory, code segments can be swapped out of memory altogether.
The UCSD p_System used a 64K memory map (standard for microcomputers of the time), but could also keep code in a separate 64K bank, freeing up data memory. The p-System also defined terminal I/O, a simple file system, serial and printer I/O, and allowed other device drivers to be added like any other operating system. It included an interactive program development system (all written in Pascal).
Western Digital implemented the p-Machine in the WD9000 Pascal Microengine (1980), based on the WD MCP-1600 programmable processor.
The JVM contains a stack stack used for parameters and instruction operands as in Forth, and a 'vars' register which points to the memory segment containing any number of local variables (like the workspace register in the Transputers).
Data typing is strongly enforced - while in Forth pushing two integers on the stack and treating them as a double is allowed, the JVM prohibits this. Object oriented support is also defined in the JVM, but not the architectual mechanisms, so implementation can vary. Objects are dynamically linked and can be swapped in or out (similar to the UCSD p-Machine, but the p-Machine segments are not grouped like objects and methods, and must be part of the program being executed, while JVM objects can be linked from external sources at run time). The other main difference between the JVM and the p-Machine is that the JVM memory segments (heap (data) and method area (code)) are not tied to a memory map, but may be allocated any way the operating or run-time system supports. Apart from that, the concept and implementation are quite similar (including multitasking support).
The Java language relies heavily on garbage collection, which is accomplished using a background thread and is not part of the JVM itself.
One other thing about the Java Virtual Machine is that some versions need to run code of unknown reliability which has been transferred over networks, and so includes security features to prevent a program from unauthorised access to the computer that it's running on.
Sun intends to produced Java processors (starting with the picoJava CPU) to execute Java bytecode directly, faster than a virtual machine or recompiled code.
RISC processors use a load/store architecture instead - to add
memory to a register, it must be loaded into an intermediate register
first.
Smaller caches are faster, so often a small level 1 cache is used, with a larger but slower level 2 cache supporting it. Level 3 caches can even be used in some cases.
Some cache controllers monitor the memory bus to detect when a
cached memory value has been modified by another CPU, or a peripheral.
Implementations generally use either 'horizontal' or 'vertical'
microcode, which differ mainly in number of bits. Microinstructions
include a condition code and jump address (jump if condition is true,
next instruction if false), and the operation to be performed.
In horizontal microcode, each operation bit triggers an individual
control line (simple CPU controller but large microcode storage), in
vertical microcode, the operation field is decoded to produce the
control signals (smaller microcode but more complex controller). Some
CPUs used a combination.
1: add r1,r2->r8 2: sub r8,r3->r3 3: add r4,r5->r8 4: sub r8,r6->r6
Instructions 1 and 3 can be executed in parallel if r8 is
renamed, and instructions 2 and 4 can then be
executed in parallel. Instruction 3 is executed before 2, out of the
order which they appear in the program.
The circutry required to keep track of renamed registers can be
complex.
...where X, Y, and Z are the point 3-D coordinates, and W is the 'weight', and is used to normalise the result after an operation, multiplying each element by 1/W so that W ends equal to 1.[X, Y, Z, W]
Points can be moved around by matric multiplication with 4X4 transformation matrices. Multiplying a vector with a matric produces a new vector, which is the transformed point. Standard transformation matrices are:
Identity (does not transform point):
[ 1 0 0 0 ]
[ 0 1 0 0 ]
[ 0 0 1 0 ]
[ 0 0 0 1 ]
Translate (move along X, Y, Z axes):
[ 1 0 0 0 ]
[ 0 1 0 0 ]
[ 0 0 1 0 ]
[ Tx Ty Tz 1 ]
Scale (translate to larger or smaller coordinates):
[ Sx 0 0 0 ]
[ 0 Sy 0 0 ]
[ 0 0 Sz 0 ]
[ 0 0 0 1 ]
Rotate (around X, Y, or Z axis by angle U):
Axis X: Axis Y: Axix Z:
[ 1 0 0 0 ] [cosU 0 -sinU 0 ] [cosU sinU 0 0 ]
[ 0 cosU sinU 0 ] [ 0 1 0 0 ] [-sinU cosU 0 0 ]
[ 0-sinU cosU 0 ] [sinU 0 cosU 0 ] [ 0 0 1 0 ]
[ 0 0 0 1 ] [ 0 0 0 1 ] [ 0 0 0 1 ]
Perspective (d is the distance of "eye" behind "screen"):
[ 1 0 0 0 ]
[ 0 1 0 0 ]
[ 0 0 1 0 ]
[ 0 0 1/d 0 ]
Transformation matrices can be combined by multiplying them
together, so a single matrix can be use to shift, rotate, and scale a
point in a single operation. Other 3-D operations using vectors are
also frequently used, such as to determine intersection points or
the reflection of light rays.
NEW
PRODUCTS
FEATURE PRODUCT
COMPUTER ON A CHIP
Intel has introduced an integrated CPU complete with
a 4-bit parallel adder, sixteen 4-bit registers, an accumula-
tor and a push-down stack on one chip. It's one of a
family of four new ICs which comprise the MCS-4 micro
computer system--the first system to bring the power and
flexibility of a dedicated general-purpose computer at low
cost in as few as two dual in-line packages.
MSC-4 systems provide complete computing and con-
trol functions for test systems, data terminals, billing
machines, measuring systems, numeric control systems
and process control systems.
The heart of any MSC-4 system is a Type 4004 CPU,
which includes a set of 45 instructions. Adding one or
more Type 4001 ROMs for program storage and data
tables gives a fully functioning micro-programmed com-
puter. Add Type 4002 RAMs for read-write memory and
Type 4003 registers to expand the output ports.
Using no circuitry other than ICs from this family of
four, a system with 4096 8-bit bytes of ROM storage and
5120 bits of RAM storage can be created. For rapid
turn-around or only a few systems, Intel's erasable and
re-programmable ROM, Type 1701, may be substituted
for the Type 4001 mask-programmed ROM.
MCS-4 systems interface easily with switches, key-
boards, displays, teletypewriters, printers, readers, A-D
converters and other popular peripherals. For further
information, circle the reader service card 87 or call Intel
at (408) 246-7501.
Circle 87 on Reader Service Card
COMPUTER/JANUARY/FEBRUARY 1972/71
There was also an ad for the 4004 in Electronic News, Nov.
1971.
The age of the affordable computer.
MITS announces the dawning of the Altair 8800
Computer. A lot of brain power at a price that's
bound to create love and understanding. To say
nothing of excitement.
The Altair 8800 uses a parallel, 8-bit processor
(the Intel 8080) with a 16-bit address. It has 78
basic machine instructions with variances over 200
instructions. It can directly address up to 65K bytes
of memory and it is fast. Very fast. The Altair
8800's basic instruction cycle time is 2 microseconds.
Combine this speed and power with Altair's
flexibility (it can directly address 256 input and 256
output devices) and you have a computer that's
competitive with most mini's on the market today.
The basic Altair 8800 Computer includes the
CPU, front panel control board, front panel lights
and switches, power supply (enough to power any
additional cards), and expander board (with room
for 3 extra cards) all enclosed in a handsome, alum-
inum case. Up to 16 cards can be added inside the
main case.
Options now available include 4K dynamic mem-
ory cards, 1K static memory cards, parallel I/O
cards, three serial I/O cards (TTL, R232, and TTY),
octal to binary computer terminal, 32 character
alpha-numeric display terminal, ASCII keyboard,
audio tape interface, 4 channel storage scope (for
testing), and expander cards.
Options under development include a floppy disc
system, CRT terminal, line printer, floating point
processor, vectored interrupt (8 levels), PROM
programmer, direct memory access controller and
much more.
PRICE
Altair 8800 Computer: $439.00* kit
$621.00* assembled
prices and specifications subject to change without notice
For more information or our free Altair Systems
Catalogue phone or write: MITS, 6328 Linn N.E.,
Albuquerque, N.M. 87108, 505/265-7553.
*In quantities of 1 (one). Substantial OEM discounts available.
[Picture of computer, with switches and lights]
A bubble can be formed by reversing the field in a small spot, and can be destroyed by increasing the field.
The bubbles are anchored to tiny magnetic posts arranged in lines. Usually a 'V V V' shape or a 'T T T' shape. Another magnetic field is applied across the chip, which is picked up by the posts and holds the bubble. The field is rotated 90 degrees, and the bubble is attracted to another part of the post. After four rotations, a bubble gets moved to the next post:
o o o
\/ \/ \/ \/ \/ \/ \/ \/
o
o_|_ _|_ _|_ _|_ _|_o _|_ _|_ o _|_ _|_ o_|_
| o | | | |
I hope that diagram makes sense.These bubbles move in long thin loops arranged in rows. At the end of the row, the bits to be read are copied to another loop that shift to read and write units that create or destroy bubbles. Access time for a particular bit depends on where it is, so it's not consistent.
One of the limitations with bubble memories, why they were superceded, was the slow access. A large bubble memory would require large loops, so accessing a bit could require cycling through a huge number of other bits first. The speed of propagation is limited by how fast magnetic fields could be switched back and forth, a limit of about 1 MHz. On the plus side, they are non-volatile, but eeproms, flash memories, and ferroelectric technologies are also non-volatile and and are faster.
Core memories consist of ferromagnetic rings strung together on tiny wires. The wires will induce magnetic fields in the rings, which can later be read back. Usually reading this memory will erase it, so once a bit is read, it is written back. This type of memory is expensive because it has to be constructed physically, but is very fast and non-volatile. Unfortunately it's also large and heavy, compared to other technologies.
Ferroelectric materials retain an electric field rather than a magnetic field. like core memories, they are fast and non-volatile, but bits have to be rewritten when read. Unlike core memories, ferroelectric memories can be fabricated on silicon chips.
Legend reports that a Swedish jet prototype (the Viggen I believe) once crashed, but the magnetic tape flight recorders weren't fast enough to record the cause of the crash. The flight computers used core memory, though, so they were hooked up and read out, and the still contained the data microseconds before the crash occurred, allowing the cause to be determined. A similar trick was used when investigating the crash of the Space Shuttle Challenger.
On a similar note, the IBM 7740 communication controller was shipped with diagnostics code in its core memory, so it could be checked out on arrival without a host machine being operational.
Interestingly enough, newer flight recorders have replaced magnetic tape with flash memories, which is a newer and more reliable form of EEPROM (Electronically Erasable Programmable ROM). This actually has nothing to do with either ferromagnetic or ferroelectric memories, though. Oh well, this is an appendix. Who reads appendices anyway?