|
The GDP was exclusively object oriented - normal linear memory access wasn't allowed, and there was hardware support for data hiding, methods, inheritance, late binding, and access protection, and it was promoted as being ideal for the Ada programming language. To enforce this, permission and type checks for every memory access (via a 2 stage segmentation) slowed execution (despite cached segment tables). It supported up to 2^24 segments, each limited to 64K in size (within a 2^32 address space), but the object oriented nature of the design meant that was not a real limitation. The stack oriented design meant the GDP had no user data registers. Instructions were bit encoded (and bit-aligned in memory), ranging from 6 bits to 321 bits long (the T-9000 has variable length byte encoded/aligned instructions) and could be very complex.
The BIU defined the bus, designed for multiprocessor support allowing up to 63 modules (BIU or MCU) on a bus and up to 8 independent buses (allowing memory interleaving to speed access). The MCU did automatic parity checking and ECC error correcting. The total system was designed to be fault tolerant to a large degree, and each of these parts contributes to that reliability.
Despite these advanced features, the 432 didn't catch on. The main reason was that it was slow, sometimes up to five or ten times slower than a 68000 or Intel's own 80286. Part of this was the lack of local (user) data registers, or a data cache. Part of this was the fault-tolerant BIU, which defined an (asynchronous protocol) clocked bus that resulted in 25% to 40% of the access time being used by wait states. The instructions weren't aligned on bytes or words, and took longer to decode. In addition, the protections imposed on the objects slowed data access. Finally, the implementation of the GDP on two chips instead of one produced a slower product. However, the fact that this complex design was produced and bug free is impressive.
Its high level architecture was similar to the Transputer systems, but it was implemented in a way that was much slower than other processors, while the T-414 wasn't just innovative, but much faster than other processors of the time.
This is not the only processor designed specifically for a language that is slow on other CPUs. Several specialized LISP processors, such as the Scheme-79 lisp processor, were created, but this chip is unique in its object oriented features. It also manages to support objects without the slowness of the Intel 432.
The Rekursiv processor features a writable instruction set, and is
highly parallel. It uses 40 bits for objects, and 24 bit addressing,
kind of. Memory can't be addressed directly, only through the object
identifiers (segments),
which are 40 bit tags. The hardware handles all objects in
memory and on disk, and swapping them to disk. It has no real program -
all data and code/methods are embedded in the objects, and loaded when
a message is sent to them. There is a page table which stores the
object tags and maps them into memory.
There is a 64k area, arranged as 16k X 128 bit words, for
microcode, allowing an instruction set to be
constructed on the fly. It can change for different objects.
The CPU hardware creates, loads, saves, destroys, and manipulates
objects. The manipulation is accomplished with a standard
AMD 29203
CPU, but the other parts are specially designed. It executes LINGO
entirely fast enough, and is a perfect match between language and CPU,
but it can execute more conventional languages, such as Smalltalk or C
if needed - possible simultaneously, as separate complete objects.
Unfortunately, Linn did not have the resources to pursue this
very promising (the prototype was "surprisingly easy" to implement)
architecture.
The 320C30 is a 32 bit floating point DSP, based on the earlier 320C20/10 16 bit fixed point DSPs (1982). It has eight 40 bit extended precision registers R0 to R7 (32 bits plus 8 guard bits for floating, 32 bits for fixed), eight 32 bit auxiliary registers AR0 to AR7 (used for pointers) with two separate arithmetic units for address calculation, and twelve 32 bit control registers (including status, an index register, stack, interrupt mask, and repeat block loop registers).
It includes on chip memory in the form of one 4K ROM block, and two 1K RAM blocks - each bus has its own bus, for a total of three (compared to one instruction and one data bus in a Harvard architecture), which essentially function as programer controlled caches. Two arguments to the ALU can be from memory or registers, and the result is written to a register, through a 4 stage pipeline.
The ALU, address controller and control logic are separate - much clearer in the AT&T DSP32, ADSP 2100 and Motorola 56000 designs, and is even reflected in the MIPS R8000 processor FPU and IBM POWER architecture with its Branch Unit loop counter. The idea is to allow the separate parts to operate as independently as possible (for example, a memory access, pointer increment, and ALU operation), for the highest throughput, so instructions accessing loop and condition registers don't take the same path as data processing instructions.
The TMS320Cx0 series also includes the 320C8x (1994?), which can have two or four DSP cores on a single chip as well as a load-store CPU (thirty two 32-bit regs, load/store, plus FPU) for control.
The 320C6x (February 1997) expands this idea by integrating functional units more tightly with the 320C8x load-store core in variable length instruction groups (VLIG) - up to eight instructions can be dispatched (grouped at compile time into packages of eight instruction plus a template indicating which are independent and can be executed in parallel) to eight functional units (two multipliers and six DSP-type ALUs). Replacing hardware with compiler scheduling (some simple hardware support, in that all instructions are predicated like the ARM) requires only 550,000 transistors for the 320C6201, while improving performance by about 10 times, based on a FFT benchmark.
In fact, it's almost a completely different processor.
Like the TMS320C30, the 96002 has a separate program memory (RAM in this case, with a bootstrap ROM used to load the initial external program) and two blocks of data RAM, each with a separate data and address busses. The data blocks can also be switched to ROM blocks (such as sine and cosine tables). There's also a data bus for access to external memory. Separate units work independently, with their own registers (generally organised as three 32 bit parts of a single 96 bit register in the 96002 (where the '96' comes from).
The program control unit has a register containing 32 bit PC, status, and operating mode registers, plus 32 bit loop address and 32 bit loop counter registers (branches are 2 cycles, conditional branches are 3 cycles - with conditional execution support), and a fifteen element 64 bit stack (with separate 6 bit stack pointer).
The address generation unit has seven 96 bit registers, divided into three 32 bit (24 in the 56000/1) registers - R0-R7 address, N0-N7 offset, and M0-M7 modify (containing increment values) registers.
The Data Unit includes ten 96-bit floating point/integer registers, grouped as two 96 bit accumulators (A and B = three 32 bit registers each: A2, A1, A0 and B2, B1, B0) and two 64 bit input registers (X and Y = two 32 bit registers each: X1, X0 and Y1, Y0). Input registers are general purpose, but allow new operands to be loaded for the next instruction while the current contents are being used (accumulators are 8+24+24 = 56 bit in the 56000/1, where the '56' comes from). The DSP96000 was one of the first to perform fully IEEE floating point compliant operations.
The processor is not pipelined, but designed for single cycle independent execution within each unit (actually this could be considered a three stage pipeline). With multiple units and the large number of registers, it can perform a floating point multiply, add and subtract while loading two registers, performing a DMA transfer, and four address calculations within a two clock tick processor cycle, at peak speeds.
It's very similar to the Analog Devices ADSP2100 series - the latter has two address units, but replaces the separate data unit with three execution units (ALU, a multiplier, and a barrel shifter).
The DSP56K and 680xx CPUs have been combined in one package (similar idea as the TMS320C8x) in the Motorola 68456.
The DSP56K was part of the ill-fated NeXT system, as well as the lesser known Atari Falcon (still made in low volumes for music buffs).
The Minimum Instruction Set Computer (MISC) Inc. M17 CPU wasn't the first Forth microprocessor (the Novix NC4000/4016 (1985?) designed by Forth inventor Chuck Moore came before), but the M17 is a good example of low cost Forth CPUs. It featured two 16 bit stack pointers (Data and Return (subroutine) stacks), plus three 16-bit top of stack data registers (X, Y, Z, plus an extra LastX which could hold values popped from X). An I/O register buffered data during I/O while the ALU operated concurrently. Finally, there was an Index register which normally held the top element of the Return stack, but could also be used as a loop counter, and a 6 instruction buffer (for short loops, like the Motorola 68010).
Address space was 64K, but external memory could be either a single bank or up to five banks, signaled by status pins, depending on the context - data stack, return stack, program code, A or B buffers. Some other Forth processors include on chip stack memory, and while most (including the M17) were 16 bit, some 32 bit Forth processors have also been developed.
The simplicity of design allows the M17 (and most other Forth CPUs, such as the more recent 7,000 transistor MuP21 (also designed by Chuck Moore), which includes a composite video generator on chip) to execute instructions in only two cycles (load, execute), or one cycle each from the instruction cache, making them faster than more complex CPUs (though instructions do less, the higher clock speed usually compensates). Stack advocates often cite this as the strongest advantage for stack based designs, though critics contend that the state nature of stacks compared to registers make conventional speedup tricks such as pipelining and superscalar execution far more complex than using a register array. As it is, register-based load-store processors dominate when it comes to speed.
Other prominent Forth-based microprocessors include the Harris
RTX-2000, a descendant of the NC4016 (the "-2000" like the name of the
Motorola 68000 comes from the fact that it only
uses about 2000 gates in its design) which has the ability to group
certain instructions like the T-9000 Transputer
and microJava
processors. Chuck Moore went on to design the 20-bit MuP21, and is
involved in the highly
integrated F21 (expected late 1998/early 1999) CPUs. A 32 bit CPU, the
FRISC-3 (Forth Reduced Instruction Set Computer) was produced by
Silicon Composers and renamed the SC-32, and includes an automatic
stack-to-memory cache, eliminating the main weakness of Forth chips,
the fixed stack sizes.
[1] Sun Microelectronics' first slogan for its Java Processors was
"Casting Java in Silicon".
Instead of registers, a thirty-two entry 32 bit two ported stack cache is provided. This is similar to the stack cache of the AMD 29000 (in Hobbit it's much smaller (64 32-bit words) but is easily expandable), and Hobbit has no global registers. Addresses can be memory direct or indirect (for pointers) relative to the stack pointer without extra instructions or operand bits. The cache is not optimised for multiprocessors.
Hobbit has an instruction prefetch buffer (3K in 92010, 6K in the 92020), like the 8086, but decodes the variable length (1, 3 or 5 halfword (16 bit)) instructions into a thirty-two entry instruction cache. Branches are not delayed, and a prediction bit directs speculative branch execution. The decode unit folds branches into the decoded instructions (which include next and alternate next PC), so a predicted branch does not take any clock cycles. The three stage execution unit takes instructions from the decode cache. Results can be forwarded when available to any prior stage as needed.
Though CISC in philosophy, the Hobbit is greatly simplified compared to traditional memory-data designs, and features some very elegant design features. AT&T prefers to call it a RISC processor, and performance is comparable to similar load-store designs such as the ARM. Its most prominent use was in the EO Personal Communicator, a competitor to Apple's Newton which uses the ARM processor.
While the transputers were originally faster than their contemporaries, recent load-store designs have surpassed them. The T-9000 was an attempt to regain the lead. It starts with the architecture of the T-800 which contains only three 32 bit integer and three 64 bit floating point registers which are used as an evaluation stack - they are not general purpose. Instead, like the TMS 9900, it uses memory, addressed relative to the workspace register (the 9900 workspace contained only sixteen registers, the Transputer workspace can be any length, though access slows down with every 4 bits used for offset from the workspace register - sixteen bytes can be accessed with just one instruction, 256 needs two, and so on). This allows very fast context switching, less than a microsecond, speeding and simplifying process scheduling enough that it is automated in hardware (supporting two priority levels and event handling (link messages and interrupts)). The Intel 432 also attempted some hardware process scheduling, but was unsuccessful.
Unlike the TMS 9900, the T-9000 is far faster than memory, so the CPU has several levels of high speed caches and memory types. The main cache is 16K, and is designed for 3 reads and 1 write simultaneously. The workspace cache is based on 32 word rotating buffers, allows 2 reads and 1 write simultaneously.
Instructions are in bytes, consisting of 4 bit op code and 4 bit data (usually a 16 byte offset into the workspace), but prefix instructions can load extra data for an instruction which follows, 4 bits at a time. Less frequent instructions can be encoded with 2 (such as process start, message I/O) or more bytes (CRC calculations, floating point operations, 2D block copies and scheduler queue management). The stack architecture makes instructions very compact, but executing one instruction byte per clock can be slow for multibyte instructions, so the T-9000 has a grouper which gathers instruction bytes (up to eight) into a single CISC-type instruction then sent into the 5 stage pipeline (fetching four per cycle, grouping up to 8 if slow earlier instructions allow it to catch up). For example, two concurrent memory loads (simple or indexed), a stack/ALU operation and a store (a[i] = b[2] + c[3]) can be grouped.
The T-9000 contains 4 main internal units, the CPU, the VCP (handling the individual links of the previous chips, which needed software for communication), the PMI, which manages memory, and the Scheduler.
This processor is ideal for a model of parallel processing known as systolic arrays (a pipeline is a simple example). Even larger networks can be created with the C104 crossbar switch, which can connect 32 transputers or other C104 switches into a network hundreds of thousands of processors large. The C104 acts like a instant switch, not a network node, so the message is passed through, not stored. Communication can be at close to the speed of direct memory access.
Like the many CPUs, the Transputers can adapt to a 64, 32, 16, or 8 bit bus. They can also feed off a 5 MHz clock, generating their own internal clock (up to 50MHz for the T-9000) from this signal, and contain internal RAM, making them good for high performance embedded applications.
Unfortunately excessive delays in the T-9000 design (partly because of the stack based design) left it uncompetitive with other CPUs (roughly 36 MIPS at 50 MHz). The T-4xx and T-8xx architecture still exist in the SGS-Thomson ST20 microcore family.
At 100MHz, the microprocessing unit (MPU) executes about one instruction per cycle, without normal instruction/data caches. Byte instructions are loaded in groups of four (32 bits), and executed sequentially. The problem of loading constants is handled in a unique way. The 68000 and PDP-11 could load a constant stored in program memory following the current instruction, and the Hitachi SH uses a similar PC-relative mode to load constants. Processors like the Mips R3000 load half a constant at a time using two instructions. Transputers always contain 4 bits of data and 4 bits of op code in each byte instruction.
The ShBoom loads single bytes of data from the rightmost bytes of the current instruction group, and words from program memory following the current group. For example, a load byte instruction could be in position one, two or three from the left, the data would always be in the fourth (rightmost) byte. Four consecutive load word instructions would be grouped together, and the constants taken fromthe four 32 bit words following the group. This ensures data alignment without extra circuitry (but may get in the way in the future, such as for 64 bit versions).
There are sixteen 32 bit global registers (g0 to g15), a sixteen register local stack (r0 to r14 can be used as a stack frame (R15 is not user visible), or as a Forth return stack), and an eighteen element operand stack (s0 to s17, accessed only by data stack operations) - the stacks automatically spill and refill to and from memory, s0 and r0 can also be used as index registers, g0 is used for multiply and divide instructions. There's also an extra index register x, a loop counter ct, and a mode register (like a CC or PSW register).
The CPU also contains an I/O coprocessor on chip for simultanious I/O (much more advanced than the I/O buffer register of the M17, but the same idea), which communicates with the MPU via the global data registers. It's a simple, independent unit which executes small data transfer programs until I/O is complete. There are also a programmable memory interface, 8 channel DMA controller, and interrupt controller.
The ShBoom architecture is a very innovative and elegant attempt at combining stack and register oriented architectures, with emphasis on the stack operation simplicity. It would give Java a good home.
The picoJava I (early 1997) is a stack oriented CPU core like the JVM, with a 64 entry stack cache (similar to the Patriot Scientific ShBoom PSC1000), but there are interesting differences between it and Forth-style stack CPUs. Java only uses a single stack (like many languages such as C, which the AT&T Hobbit and AMD 29K were designed to support) and the picoJava CPU enhances performance with a 'dribbler' unit which constantly updates a complete copy of the stack cache in memory, without affecting other CPU operations (similar to a write-back cache), so stack frames can be added without waiting for a stack frame to be stored. Some Java instructions are complex, so the CPU has microcoded instructions, and a 4 stage pipeline (fetch, decode, execute/cache, stack writeback). Finally, picoJava groups (or 'folds') load and stack operations together, executing both at once (treating the top of stack as an accumulator) (this is a much simpler version of instruction grouping tried in the Transputer T-9000), This usually eliminates 60% of stack operation inefficiency. Seldom used instructions aren't implemented, but are emulated using trap handlers.
The picoJava II (October 1997) core is used in the first actual CPU from Sun, the microJava701. It extends the pipeline to 6 stages, and can four up to four instructions into one operation. It also adds a FPU and separate 16Kb I/D caches.
|
Best viewed with Netscape Communicator
![]()
Last update: October 1998
The most recent version of this paper you could find at its original location