RISC: The Processor Architecture of the Future

Introduction

In this essay I shall be arguing the benefits of the RISC school of processor design over more traditional instruction set architectures, while at the same time telling the story of the development of RISC in the wider context of the history of computers.

RISC stands for 'Reduced Instruction Set Computer', the basic premise of which is as follows: Conventional processors spend most of their time executing only a small subset of simple instructions. By providing only these commonly used instructions and abstaining from complex patterns of memory access, RISC processors can be simpler and hence run faster and more efficiently.

The story of RISC really begins with the early computers of the 1940s such as Colossus and ENIAC, where there were no stored programs at all - only hardware patch-panels. Programming these machines was a matter of altering the physical configuration of the circuits.

The subsequent development of stored-program machines such as the Manchester Mark I allowed programs to be loaded into memory along with data. Now programs could be written in machine code, or its symbolic equivalent - assembly language. However, writing software in assembly language with a simple instruction set is a laborious and time-consuming process: "It is like directing someone by telling him which muscles to move" (Ceramalus, 1998).

In order to make life easier for the programmers, hardware designers began providing more and more powerful instructions - a trend that would continue until the end of the 1970s. The instructions being built into machines became so complex that they had to be realised using microcode rather than random logic - high-level functions are interpreted into microinstructions that are designed to execute in a single cycle.

Another strategy that was developed to ease life for the programmers was the creation of high-level languages, where programs were interpreted or compiled into low-level machine instructions. The full significance of this development was not initially realised by the hardware engineers, however, who continued designing computers to be programmed in assembly language.

The 801 project

The birthplace of the modern-day RISC architecture was undoubtedly IBM's Thomas J. Watson Research Centre, where in 1974 a team of engineers began work on a project to design a large telephone switching network capable of handling 300 calls per second. With stringent real-time response requirements, an ambitious performance target of 12 MIPS was set. "This specialised application required a very fast processor, but did not have to perform complicated instructions and had little demand for floating-point calculations." (Cocke and Markstein, 1990)

The telephone project was terminated in 1975, with no machine having been built. However, the design had a number of promising features that seemed an ideal basis for a high performance minicomputer with a good cost/performance ratio:

Separate instruction and data caches, allowing high bandwidth memory access.
No arithmetic operations to storage, allowing a greatly simplified pipeline.
Uniform instruction length and simplicity of design (only ten levels of logic), making possible a very short cycle time.

These are still the basic principles underlying RISC machines today, although of course at the time that acronym had not yet been invented. Instead, the new minicomputer project was named '801' after the number of the building where the research was taking place.

Contemporary machines such as the System/370 had a large number of complex instructions, the prevailing wisdom in the 1970s being that "...the more instructions you could pack into a machine the better..." (Ceramalus, 1998). However, programmers and compilers were making little or no use of a large number of these instructions. As Radin (1983) explains "...in fact these instructions are often hard to use, since the compiler must find those cases which exactly fit the architected construct."

It was not until researchers analysed the vast amount of data that IBM had on instruction frequencies, that this waste was noticed: "...it was clear that LOAD, STORE, BRANCH, FIXED-POINT ADD, and FIXED-POINT COMPARE accounted for well over half of the total execution time in most application areas." (Cocke and Markstein, 1990)

The key realisation of the 801 team was that not only were the complex high-level instructions rarely used, but they were having a "...pernicious effect on the primitive instructions..." (Radin, 1983).

If the presence of complex instructions adds extra logic levels to the basic machine cycle, or if instructions have to be interpreted into microcode, then the whole CPU is slowed down: "Imposing microcode between a computer and its users imposes an expensive overhead in performing the most frequently executed instructions." (Cocke and Markstein, 1990)

Since the goal of the 801 was to execute the most frequently used instructions as fast as possible, many of the more complex System/370 instructions were deliberately excluded from the instruction set. This left a machine that was "...very similar to a vertical microcode engine..." (Cocke and Markstein, 1990), "...but instead of 'hiding' this attribute behind a complex instructions set in microcode, we exposed it directly to the end user."

Apart from efficiency gains in execution of the simplest instructions, programming complex functions as macros or procedures rather than in hardware was found to have interesting advantages:

The CPU is very responsive to interrupts, being interruptible at "microcode" boundaries: CISC architectures must either restrict interrupts to coarse-grained instruction boundaries, or define interruptible points and deal with the complexities of guaranteeing atomicity and re-starting instructions.
An optimizing compiler can often separate and rearrange the components of a function, e.g. moving some parts out of a loop or rescheduling memory accesses.
It is often possible for parts of a complex instruction to be computed at compile time - for multiplication by a constant, a compiler can often substitute more efficient shift/add sequences.

Of course there were also some disadvantages to dispensing with microcode:

One of the benefits of vertical microcode is the residence of micro-instructions in a high-speed control store. "This amounts to a hardware architect attempting to guess which subroutines, or macros, are most frequently used and assigning high-speed memory to them." (Radin, 1983)
Inevitably, simple instructions reduce the code density of RISC machines, and hence programs take up more memory.

In order to match the performance characteristics of vertical microcode, the 801 fetched instructions via a high-speed "least-recently-used" cache, in which all frequently used functions were likely to be found. In any case, the simplified pipeline allowed instructions to be fetched more easily.

In the event, serious problems with the code density of RISC machines did not materialise. "The code sequences were not unduly long or unnatural. In later years, path-length comparisons between RISC and CISC architectures have been shown to be very nearly equal." (Cocke and Markstein, 1990)

This result stems partly from the comparative rarity of complex instructions in conventional code, but also because the 801's regular instruction format and large number of registers allowed the compiler to perform greater optimisation. In any case, at the time "...memory became cheaper... and the motivation for making small but really powerful instructions faded." (Ho, 1998)

When the 801 minicomputer was eventually built in 1978 it was IBM's fastest experimental processor. However, the project was terminated by IBM in 1980 without the 801 ever reaching the market.

Berkeley RISC

In 1980, at the University of California, Berkeley, Dr David Patterson and his team began a Reduced Instruction Set Computer project, in order to investigate "...an alternative to the general trend toward computers with increasingly complex instruction sets..." (Patterson and Sequin, 1981).

Their objective was the design of a VLSI computer that would reduce "...the delay-power penalty of data transfers across chip boundaries...", and make better use of "...the still-limited amount of resources (devices) available on a single chip." This revolutionary machine was named 'RISC-I' - the first time that this acronym had been used.

We cannot know for certain whether or not Patterson truly had heard "...rumours of the 801...", as Ceramalus (1998) suggests, but he was certainly averse to the complexity of architectures such as the Intel iAPX-432, citing "...increased design time, increased design errors and inconsistent implementations." (Patterson and Sequin, 1981)

Ceramalus (1998) is also voluble in his condemnation of the iAPX-432 as a system that "...represents the height of CISC lunacy." The ill-fated APX reportedly ran "...slower than older 8-bit systems...", being spread across 12 chips, with 222 instructions that varied in length between 17 and 300 bytes!

At Berkeley, they independently came to similar conclusions about desirable architectural constraints as had the 801 project team: (Although at the time none of that research had been published.)

Simple instructions allow one instruction to be executed per cycle.
All instructions are the same size, to simplify implementation.
Only load and store instructions access memory; the rest operate between registers.
Multiple banks of registers, in order to reduce off-chip memory accesses.
Architecture designed with the needs of high-level language programming in mind.

Like Radin (1983) before them, Patterson and Sequin (1981) observed that "...this simplicity makes microcode control unnecessary. Skipping this extra level of interpretation appears to enhance performance while reducing chip size." Furthermore, "....the RISC programs were only about 50% larger than the programs for the other machines, even though size optimisation was virtually ignored."

Apart from these general RISC findings, the main contribution of the Berkeley project to the field was the invention of overlapping sets of register banks (or 'register windows') to enable parameters to be passed directly to subroutines. This system was developed with the goal that "...procedure CALL must be as fast a possible...", because "...the complex instructions found in CISCs are subroutines in RISC..." (Patterson and Sequin, 1981)

The way in which the register window scheme worked was that the set of registers was broken into four chunks: GLOBAL registers are preserved across procedure call. HIGH registers contain parameters passed from 'above' the current procedure, LOCAL registers are used for local variables, and LOW registers are used to pass parameters to procedures 'below'. On procedure call, the hardware overlaps the register windows, such that the LOW registers appear as the called procedure's HIGH registers.

This innovative approach avoids the time-consuming operations of saving registers to memory on procedure call, and restoring them on return (thus making the desired saving in "...data transfers across chip boundaries..."). The system was later adopted for Sun's SPARC architecture.

Patterson and Sequin (1981) conclude that having "...taken out most of the complexity of modern computers... we can build a single-chip computer much sooner than the traditional architectures..."

One such modern-day VLSI computer is the ARM7500FE 'system chip', which integrates a RISC processor, FP co-processor, video/sound controller and memory/IO controller. The fact that this is possible supports Dr. Patterson's scepticism about whether "...the extra hardware needed to implement CISC is the best way to use [the limited number of transistors available on a single chip]..." - It isn't!

Of course, the actual manner in which the spare 'room' on RISC chips is used is a matter for the designers. In the case of the the high-performance Digital Alpha 21164 (of which more later), the extra space allows a large 96KB on-chip level 2 cache, which inflates the transistor count from a modest 1.8 million to 9.3 million.

The road to acceptance

Although the IBM 801 project had been terminated in 1980, it had indirectly seeded the industry: Various disgruntled members of the team went to other companies, and rumours of the project began to emerge. However, nothing of the actual research was published until 1982.

Apparently this managerial decision caused some considerable degree of acrimony: Joel Birnbaum was so angry at the cancellation of the 801 project that he went to Hewlett-Packard and started them on the road toward their own PA-Risc architecture. Ahmed Chibib, who worked on compiler technology at IBM, recalled "We had a 3 MIPS RISC PC on paper in 1978 and we could have launched it in 1979" (quoted in Ceramalus, 1998)

Whatever the truth of such claims, the fact is that IBM chose the Intel 8088 to be the heart of their PC, which was launched shortly afterwards. The 8088 was little different to a processor invented by Datapoint in 1967/8, and ARM's David Jagger scornfully describes it as "The worst chip they could have chosen" (quoted in Ceramalus, 1998). Ceramalus (1998) theatrically describes IBM's decision as "...the world's Great Leap Backward into the Intel-based PC..."

Behind the scenes, however, work on RISC systems continued, with the MIPS architecture growing out of research done at UC Stanford, and Sun's SPARC processors based on Berkeley's register-window designs. In the mid-eighties there was a sudden flurry of announcements from Hewlett Packard, Sun and MIPS Computer Systems as all this R&D finally reached the marketplace. Spurred into action, IBM hurriedly dusted off their own RT (RISC Technology) workstation, which had been been languishing since the beginning of the 80s.

The 1970s had been the heyday of CISC, but during the 1980s the market for high-end workstations became dominated by RISC computers. Ceramalus (1998) makes unflattering comparison between the MIPS R3000 (20 MIPS on 115,000 transistors) and its Intel contemporary the 80386 (4 MIPS on 350,000 transistors).

Ho (1998) describes how people began to associate RISC with buzzwords such as "superscalar, lots of registers, fast floating point performance". Meanwhile CISC was becoming associated with unflattering characteristics such as "segmented memory model, few registers and crappy floating point performance". (Although as Ho points out, these were really characteristics specific to Intel's chips of the 1980s.)

With the launch of the ARM-based Archimedes in 1987, Ahmed Chibib's vision of a RISC PC was finally realised. However, in the intervening eight years the Intel x86 CISC PC had grown to dominate the microcomputer landscape, and would continue to do so for decades to come.

Are RISC and CISC architectures converging?

Today, some people argue that the gap between CISC and RISC architectures has become blurred to the extent that the distinction is no longer relevant. This point of view is highlighted by Ho (1998), who states that "Intel's machines still run the old instruction set... but they do so with otherwise RISC-like characteristics", while "RISC-like machines from Sun and others have added more instructions to their architectures".

First, let me tackle the question of RISC machines acquiring what Mashey (1997) terms "...certain baggage that they'd wish weren't there...".

Even in the early Berkeley RISC research there was compromise on the goal of one instruction execution per cycle: "We decided to make an exception to our constraint of single-cycle execution rather than extend the general cycle to permit a complete memory access" (Patterson and Sequin, 1981).

Interestingly, Radin (1983) says of the pioneering 801 project that "We have no objection to [implementing a complex function in random logic], provided the frequency of use justifies the cost and, more importantly, provided these complex instructions in no way slow down the primitive instructions". Of the IBM RS/6000 that evolved from the 801, Cocke and Markstein (1990) admit that the instruction set "has been enhanced... with some decidedly complex instructions", and that "no one-cycle instructions have led to acceptable performance for floating-point computation".

The arguments for convergence from the other direction are generally based on the perceived "RISC-like characteristics" of more recent x86 CISC processors. Whilst the instruction-set architecture of modern PCs dates from the archaic 8088 "...with its warts: variable-length, variable memory interface, lots of types of instructions..." (Ho 1998), the latest implementations use an aggressive design to overcome these limitations.

In a recent case study, Bhandarkar compared the the performance of two organisationally similar "state of the art implementations" from the RISC and CISC architectural schools: The Digital Alpha 21164 and Intel Pentium Pro. Whilst a previous paper had found that the advantage (in cycles per program) of a MIPS M/2000 over a Digital VAX 8700 ranged from a factor of 2 to almost a factor of 4, this new study concludes that the Pentium achieves 80% to 90% of the performance of the Alpha on integer benchmarks and transaction processing workloads. How has this performance breakthrough been achieved?

The Pentium Pro implements an out-of-order, speculative execution engine, with register renaming and memory access reordering. The flow of CISC instructions is predicted and decoded into micro-operations (uops). These uops are register-renamed and placed into an out-of-order pool of pending operations. It is this RISC-like use of simple uops, memory access reordering and a large set of 40 physical registers that has led some to describe modern x86 processors as "RISC implementations of CISC architectures" Mashey (1997).

However, Mashey (1997) also makes a strong argument that real differences in the characteristics of RISC and CISC architectures persist, and emphasises the distinction between "ARCHITECTURE and IMPLEMENTATION". In other words, whilst modern implementations of CISC may be able to "...overcome the instruction set level limitations..." (Bhandarkar), this does not remove the real implementation difficulties inherent in their architectural addressing features.

In answer to observations about the actual number of instructions in modern RISC architectures, Mashey (1997) points out that "More instructions aren't what really hurts you, anywhere near as much features that are hard to pipeline." This is why many engineers feel that 'load/store' is a more useful term than RISC, for describing architectures such as ARM and SPARC.

In a discussion of the relative difficulty of pipelining CISC instructions, Mashey outlines the many stages in the execution of the complex VAX instruction "ADDL @(R1)+, @(R1)+, @(R2)+". In summary, his observations were as follows:

Because load may be unaligned, every memory access may potentially cause 2 MMU uses (for the two halves of the data).
In conjunction with the fact that the operands are indirected, every operand can now potentially cause 4 MMU uses (and hence potentially 4 page faults/cache misses).
The loaded operands and then the result of the addition must implicitly be kept in temporary buffers (not registers).
Actual incrementation of R1 and R2 will either have to be postponed to successful completion of the entire instruction, or else copies of modified registers must be kept.
Because store may be unaligned, you cannot begin to modify memory unless you know it will complete, since a page fault might leave an inconsistent state...

Although the worst case scenarios are very unlikely, they are legal and the CPU must be designed to cope with them: You have to worry about unlikely cases, it costs in both buffering and state to go fast, and pipelining is potentially very complex.

By way of contrast, on a RISC processor every load/store uses exactly 1 MMU access, the compiler often has freedom to rearrange the order of these accesses, and the data alignment requirement avoids the problem of cache misses or page faults before both halves of a load/store have been completed. (See Appendix I for a comparison with RISC code to perform the operation equivalent to the VAX instruction.)

Whilst RISC processors may legitimately support weird and wonderful instructions for FP arithmetic or graphics transforms, "...compare that with the consequence of adding a single instruction that has 2-3 memory operands, each of which can go indirect, with auto-increments, and unaligned data..." (Mashey, 1997). In general, load/store architectures should have no more than one memory access per instruction, no indirect addressing, and no instructions that combine load/store with arithmetic.

If we accept that these characteristics define the 'line in the sand' between RISC and CISC, rather than blind instruction-counting (which is in any case problematic for many architectures), then the distinction between them is as valid as ever. Or, as one journal article bluntly puts it "CISCs are Not RISCs, and Not Converging Either"! (Microprocessor Report, March 26, 1992)

Conclusion

The research that led to the development of RISC architectures represented an important shift in computer science, with emphasis moving from hardware to software. The eventual dominance of RISC technology in high-performance workstations from the mid to late 1980s was a deserved success.

In recent years CISC processors have been designed that successfully overcome the limitations of their instruction set architecture. This feat has been achieved by using an impressive combination of new technologies such as out-of-order execution, branch prediction, speculative execution, register renaming and micro-dataflow.

Far from refuting the early observations about efficient ISAs made at Berkeley and IBM's 801 project, these developments in modern CISC processor design are a recognition of the importance of this research. The difference is that the optimisations that Radin, Cocke et al. expected to be made by a sophisticated compiler for a high-level language (intelligent allocation of many registers, instruction re-ordering) are being done by complex and power-hungry hardware.

It was perhaps significant that when Intel acquired the StrongARM from Digital, one analyst commented "It's everything Intel's chips are not: Fast, inexpensive, low power" (Turley, quoted in Ceramalus, 1998).

Of course the proponents of RISC systems should not rest on their laurels, as they seem to have done during recent years, confident that the "...Intel x86 was doomed to slowly plod along with its complex instructions" (Ho, 1998). RISC machines may be more elegant and power-efficient, but compilers need to be improved and clock speeds need to increase to match the aggressive design of the latest Intel processors.

To conclude, the question is not whether the limitations of a CISC instruction set architecture can (by clever implementation) be overcome, but to decide for how much longer the difficulty and complexity of doing so will be worthwhile. With current trend in low-power mobile computing, and a diminishing need to run legacy 16-bit Windows binaries, it seems unlikely that the future of computing will hold a place for the expensive dynamic interpretation of archaic instruction set architectures such as the Intel 80x86.

References

George Radin, "The 801 Minicomputer", IBM Journal of Research and Development, Vol.27 No.3, 1983

John Cocke and V. Markstein, "The evolution of RISC technology at IBM", IBM Journal of Research and Development, Vol.34 No.1, 1990

Dileep Bhandarkar, "RISC versus CISC: A Tale of Two Chips", Intel Corporation, Santa Clara, California
Ron Ho, "RISC versus CISC", 1998

David A. Patterson and Carlo H. Sequin, "RISC I: A Reduced Instruction Set VLSI Computer", Proceedings of the Eighth Annual Symposium on Computer Architecture, May, 1981

John Mashey, "RISC vs. CISC", http://www.inf.tu-dresden.de/~ag7/mashey/RISCvsCISC.html, 1997 (accessed 29th January 2002)

Nobilangelo Ceramalus, "A Brief History of RISC", Acorn User, September and October 1998, pp.48-49

Appendix I

In his discussion of the difficulty of pipelining CISC instructions, Mashey (1997) cites the following VAX code, equivalent to the C expression "**r2++ = **r1++ + **r1++;":

ADDL @(R1)+, @(R1)+, @(R2)+

On an ARM processor, the equivalent operation could be coded as follows, using two temporary registers R3 and R4. (This is unlikely to present a problem since, in general, RISC architectures have more registers than CISC architectures.)

LDR R3,[R1],#4 ; load the address of operand 1, then increment R1
LDR R4,[R1],#4 ; load the address of operand 2, then increment R1
LDR R3,[R3] ; load actual operand 1
LDR R4,[R4] ; load actual operand 2
ADD R3,R3,R4 ; add operand 1 and operand 2
LDR R4,[R2],#4 ; load the address for the result, then increment R2
STR R3,[R4] ; store the result of the addition

As can be seen, a number of RISC instructions are needed to perform the same task as the original VAX instruction. However, these simple instructions execute in very few cycles, and there are none of the attendant overheads in design complexity, nor the same difficulties in handling unaligned data and mid-instruction occurrences of cache misses, page faults or exceptions.

In actual fact, my code sequence could be optimised further by replacing the first two LDR instructions with an LDMIA (LoaD Multiple - Increment After) instruction:

LDMIA R1!,{R3,R4} ; load the addresses of both operands, then increment R1 by two words
LDR R3,[R3] ; load actual operand 1
LDR R4,[R4] ; load actual operand 2
ADD R3,R3,R4 ; add operand 1 and operand 2
LDR R4,[R2],#4 ; load the address for the result, then increment R2
STR R3,[R4] ; store the result of the addition

This brings the total number of RISC instructions needed to perform the VAX operation down to six, although of course the LDM will take a few cycles to complete. Both this multiple-register load and the LDR instructions have an increment for the pointer register encoded in them: Relative to some other RISC processors, it can be seen that ARM code is relatively compact.

N.B. The ARM's LDM and STM instructions do break the general 'rule' that load/store architectures should have no more that one memory access per instruction.

This is mitigated somewhat in that performing 'load many' and 'store many' operations is not really in the same class as having combinations of load and store operations within a single instruction, possibly indirected and/or combined with arithmetic operations.

However, some purists have argued that "...ARM is not a 'pure RISC'..." (Hot Chips IV, quoted in Mashey, 1997) or somehow on the 'RISC-CISC border'. In my opinion this is rather an extreme attitude, and not one that I believe is very widely held.