Naturally, RISC OS already has its own built-in graphics API. I shall examine this in some detail in order to explain why it is inadequate for many advanced graphics applications, and then look at some alternative graphics systems.
The main graphics API for RISC OS is the VDU drivers, which are provided by the kernel. These are substantially unchanged from the BBC Microcomputer's graphics system, offering a high degree of backwards-compatibility.
Graphics commands are invoked by sending special character
codes to the VDU stream (e.g. by calling
"OS_WriteC"). Up to 9 following bytes may be sent as
parameters to a VDU command. These are queued until the
required number arrive, before the command is executed. The
VDU drivers do not maintain a large amount of internal state,
apart from the graphics origin, text/graphics windows and the
last three graphics cursor positions. 
Unlike many modern graphics systems, there is no direct support for matrix multiplication or 3-D graphics. All co-ordinates must be specified by the program pre-transformed (e.g. to 2-D co-ordinates), for every graphics primitive that is to be drawn. Thus the opportunities for hardware graphics acceleration of the API are limited to a raster subsystem, rather than the full 3-D transformation and clipping pipeline provided by many modern graphics cards.
Vector graphics are specified in an internal 2-D coordinate
space known as "OS co-ordinates". Because these do not
necessarily equate to actual screen pixels, the same graphics
can be plotted at different screen resolutions without
distortion. The relationship between OS co-ordinates and
pixels is described by "eigen factors", which scale
co-ordinates by means of arithmetic shifts (e.g.
screen_x = OS_x >> x_eigen_factor).
A wide variety of graphics primitives are supported by the VDU drivers: points, lines, triangles, rectangles, parallelograms, circles, arcs, sectors and ellipses. These may be plotted in different styles e.g. filled or outline shapes, and solid or dotted lines. All primitives are drawn in uniform colour, with no support for advanced rendering operations such as gouraud shading, anti-aliasing or texture mapping.
All operations relating to vector graphics are accessed by
VDU command 25, although the SWI
provides a nicer interface to the same facilities. A plot
command code specifies the operation, such as "draw circle"
or "move graphics cursor". Custom plot routines may be
implemented by claiming
UKPLOTV, the vector to
which plot commands unknown to the VDU drivers are passed.
Only one set of XY co-ordinates may be specified per plot command. This constraint explains why the VDU drivers remember two previous graphics cursor positions - to plot a shape such as a triangle, three sets of co-ordinates are needed. Here is an example (in BASIC) of how a filled triangle would be drawn using the VDU drivers:
Figure 1: Calling the VDU drivers from BASIC
Figure 2: Three VDU commands to plot a triangle
It should be obvious from the example above, that this form of graphics system has major drawbacks when it comes to high-speed plotting. For every redraw of a scene, the co-ordinates of each primitive must be re-specified to the VDU drivers, with a very high overhead in terms of the number of SWI calls needed.
In terms of efficiency this compares very badly with other
graphics systems such as OpenGL, which can maintain
internal lists of primitives to draw. Rather than
re-specifying the co-ordinates of every primitive, a redraw
of the whole scene can be invoked by use of a single function
call such as
The design of the RISC OS graphics API also has implications for hardware graphics acceleration. Because the VDU drivers are part of the kernel, they can only be modified by claiming vectors. This makes writing drivers for graphics hardware rather difficult, compared to simply replacing a system module with another that provides the same functional interface.
Where complex high-speed graphics are needed (such as for
games), most programmers bypass the VDU drivers entirely in
order to avoid making a huge number of SWI calls. Instead,
they call the SWI
"OS_ReadVduVariables" to find
out the memory address of the frame-buffer, and plot to this
directly using custom ARM code routines. Typically these
routines will be optimised for a particular screen resolution
and colour depth, and may perform texture mapping, gouraud
shading or other advanced rendering operations unsupported by
the VDU drivers.
Bypassing a graphics API is in this manner is generally considered to be a Bad Thing4, but often it is a necessary evil under RISC OS. Unfortunately, this means that advanced graphics programming has remained something of a 'black art', known only to a few highly skilled ARM coders.
There are also implications for the future compatibility of programs that write to the frame-buffer directly: Highly-optimised programs have a nasty habit of failing as minor changes are made to the underlying hardware specification.
In 1996 this was demonstrated to RISC OS users in harsh fashion, when the vast majority of games failed on the new Digital StrongARM processor. The StrongARM architecture has significant differences, notably that it has a 5-stage rather than 3-stage pipeline, and Harvard architecture (split data and instruction caches). All self modifying code fails because "instructions written into memory may linger in the data cache, and not be in real memory when the instruction cache comes to fetch them". 
Recently Pace have been working to move RISC OS from 26-bit mode to 32-bit mode, because modern ARM processors no longer support 26-bit mode. This will cause a crisis of similar proportions for low-level ARM code, and therefore the more programs that access facilities through high-level APIs, the easier it will be to make the transition.
As I have explained, a large number of applications, games in particular, bypass the VDU drivers and write directly to screen memory. This works fine on older RISC OS systems, where the CPU and memory co-exist in close proximity on the motherboard. However the current trend in computer hardware is for the system's video memory to live on an expansion card that handles all graphics plotting.
Windfall Engineering's Viewfinder podule5 plugs into the expansion bus on RISC OS machines, providing an interface into which an off-the-shelf AGP graphics card can be plugged. The podule bus is slow by modern standards, and therefore directly accessing the frame buffer on the AGP card is also very slow. In practice, this means that whilst desktop applications that use the VDU drivers benefit massively from hardware acceleration, most games are unplayable.
Having identified the RISC OS graphics system's deficiencies in terms of support for modern 3-D graphics hardware and advanced rendering, I looked at some of the industry standard graphics systems by way of comparison. In the main my focus was on Silicon Graphics' OpenGL, because of its current predominance in the realm of PC graphics accelerators and computer games.
"OpenGL (for 'Open Graphics Library') is a software interface to graphics hardware. The interface consists of a set of several hundred procedures and functions that allow a programmer to specify the objects and operations involved in producing high-quality graphics images, specifically color images of three-dimensional objects." 
OpenGL is a state-driven graphics API with a rich feature set - it supports all the rendering operations that most client programs are likely to require. Originally designed by Silicon Graphics for high-end graphics workstations, it infiltrated the desktop market when the first games supporting 3-D graphics cards began to appear.
It has a client-server architecture, meaning that the GL may be viewed as the server, which the client program addresses through an API. This design makes it "network transparent" - the server may actually be on a different computer to the client application.
The phrase "state-driven" means that an OpenGL implementation maintains a large amount of state information that controls how graphics are rendered. This is why there are relatively few OpenGL commands, despite the large number of rendering options. The exact effect of a given command on the framebuffer is dependant on the current state.
Unsupported rendering options can simply be ignored by an implementation: If a client asks for texture-mapped polygons but the GL doesn't support that, flat-shaded polygons are used instead. If the library is subsequently improved to support texture-mapping then this can be implemented transparently to client applications.
OpenGL does much more than simply graphics rendering, however, with support for matrix transformations on 3-D vertices and projection of a scene onto the screen. This contrasts with the lack of transformation facilities provided by the RISC OS system. Lighting calculations and depth fogging may also be delegated to the GL.
A client may choose to bypass these facilities by performing it's own transformations and lighting, reducing the role of OpenGL to rasterisation. However, as Lee Johnston observes "...with the advent of graphics accelerators capable of performing this task this is likely to change." 
Others have noticed the need for a graphics library for RISC OS that provides superior facilities to the kernel's VDU drivers. Over the last decade many libraries have been proposed - some falling by the wayside, but many coming to fruition. Before embarking upon designing my own library, I looked at some of these previous attempts.
Although not strictly speaking a graphics library in the normal sense, having no properly defined API, Andrew Hutchings's polygon plotter deserves a mention: Partly because it was the basis of so many of the best-loved RISC OS games of yesteryear ("Chocks Away", "Stunt Racer 2000", "Star Fighter 3000"), but mainly because it typifies the 'bad old way' of doing things.
The plotter is written in highly optimised ARM code, and it is basically designed to draw flat coloured polygons as fast as possible to a fixed-resolution 8-bpp framebuffer. It does manage an impressive throughput of polygons, allowing smooth 3-D games in an era when an 8Mhz Archimedes was standard hardware.
In order to achieve this kind of performance some horrendous coding tricks are employed, notably unrestrained use of self-modifying code. For example when the clipping rectangle is changed, this is done by overwriting the relevant comparison instructions in the plotter. Similarly, when a different style of polygon plotting is set up (i.e. wireframe mode), a branch instruction to the relevant plotting code is inserted.
The division operation used to find the gradient of a polygon edge is also of interest, using two unrolled loops with a total length of 330 instructions (though most of these will not be executed on any given run). This is a clear trade-off between memory usage (1320 bytes of code) and execution speed.
Figure 3: Unrolled division loop, as coded in the BASIC assembler
Apart from dating the code as archaic, both unrolled loops and self modification demonstrate the vulnerability of over-optimised code to the passage of time. I have already discussed the failure of self-modifying code on StrongARM. As for the unrolled loops, the Acorn Assembler manual has this to say: "...unrolling loops (e.g. divide loops) speeds execution on an ARM2, but can slow execution on an ARM3, which has a cache." 
The TAG (TBA Advanced Graphics) engine genuinely was advanced for its time, unusually for a RISC OS graphics library. Obviously it was bound by the usual constraints of limited processing power, but within the realm of fast games-oriented graphics, its capabilities have not yet been surpassed.
Written by Martin Piper, TAG started life in late 1992 as a small 3D game demo at an Acorn World show. "That little demo generated so much interest that we started work on a proper 3D graphics engine." Many years later, TAG has actually been used as the core for various commercial games in diverse genres, such as "Cobalt Seed" (flying blast-em-up) and "Brutal Horse Power" (motor racing).
Unlike the other libraries described here, TAG is a fully integrated game engine capable of much more than simply plotting 3-D graphics. In the words of the author, it "handles all the difficult stuff" such as memory management, sound & music, object movement and collision detection. The gameplay itself is implemented by means of client-supplied 'handlers' - small ARM code routines that are activated upon events, such as a collision between objects in the game world. A desktop editor named 'Holograph' manages resources, and allows editing of 3-D objects and game parameters. 
In terms of actual rendering capabilities, TAG supports texture-mapping of polygons and lightsourcing effects, unlike many earlier renderers. TAG1 was restricted to a fixed low resolution (320´256) screen mode, but TAG2 supposedly supports variable screen resolutions, gouraud shading and "much improved" lighting and texture mapping.
TBA Software claim a speed of 20,000-27,000 filled polygons or 9,000-12,000 texture mapped polygons per second on a 40Mhz Risc PC under realistic conditions ("while running a game and shooting things"). This impressive performance is derived from the fact that, like Andrew Hutching's plotter, TAG is "100% hand coded assembly".
So why does a game engine as powerful as TAG not enjoy widespread use? There seems to be some disagreement about its utility to outside developers. Lee Johnston  writes that "...there is no documentation, no pretence at support (these are busy people)..." This is in direct contrast to TBA Software's own view: "...all of our developers have access to [TAG2] and other technical support we at TBA can offer. To become a developer all you have to do is ask..." 
Also, I would cite the lack of more general language support - nowadays many programmers prefer a high-level language (typically C) rather than ARM code, for the nuts-and-bolts aspects of writing a game. Ironically, the powerful game-creation facilities offered by TAG may also have discouraged many prospective developers. Most programmers have their own ideas about how their game engine should be written, and dislike tools that provide an insufficiently flexible interface.
This project is typical of many RISC OS graphics library, in that it apparently never came to fruition. Scorpion has been touted on the internet for a number of years, with an extremely impressive feature list6 raising high expectations :
Despite all the impressive rendering capabilities claimed by the author, in my opinion the most interesting feature is the language interface. Apparently the library is written in C++ and Assembler, with a C++ programmer's API. The idea of a direct interface to a modern object-oriented programming language contrasts starkly with the esoteric assembly language interfaces provided by earlier rendering libraries.
Scorpion is designed only for computers fast enough to support high-quality graphics rendering, with no pretensions to support pre-StrongARM processors.
Although a very early version was used in a graphics demo for a 1997 competition, nothing has been heard of the project since February 2000. It is unknown when, if ever, Stuart Lovegrove will publicly release Scorpion.
If I were to characterise the main problem with this library in one phrase it would be "over ambitious". The project could have been released years ago, and still been the most advanced rendering library available.
Also, the fact that there is no support for older ARM processors is likely to discourage developers from using Scorpion, since they would not want to restrict themselves to a niche market. Even today, most software publishers make a significant effort to make sure that their products are at least usable on ARM7500-based computers.
There exists a RISC OS port of Brian Paul's Mesa , a well known open-source implementation of the OpenGL graphics standard. I shall be examining OpenGL in detail later in this document.
The port is maintained by David Boddie, who has written a RISC OS driver for Mesa ('drivers' provide low-level OS-specific functions, such as plotting to the framebuffer). The majority of the library is written in C, with a small core of ARM code routines to speed up rendering and some transformation functions.
As a full implementation of OpenGL, Mesa has impressive capabilities. There is support for niceties such as fog effects and dynamic lighting, as well a large range of polygon plotting styles. All available combinations of screen resolution and colour depth are supported.
Although it is valuable to have a full OpenGL implementation for RISC OS, on current hardware Mesa is really too slow where fast graphics are required. On an ARM7 the simple graphical demos work (slowly) at low-resolution, but a StrongARM is really needed for acceptable performance.
Another consideration for developers is that at well over a megabyte the Mesa library is very large, certainly compared to a native Acorn game engine such as TAG, which TBA Software claims to be only 77KB in size. Small demo programs require between 3 and 4 megabytes of RAM, a not insignificant amount considering that even very large and powerful programs for RISC OS typically require only 1 or 2 megabytes.
Finally, Mesa makes exceptionally heavy usage of floating point arithmetic. Running a crude code analysis program over the supplied Mesa demos reveals that an astonishing 14% of the total code size consists of floating point instructions. By way of contrast, the same analysis of other programs suggests an average incidence of less than 1%. The issue of floating point arithmetic under RISC OS is discussed further in section 5.
Although hardware abstraction of a kind can be seen, none of the RISC OS systems I have looked at have anything approaching the 'plug in' driver interface for graphics renderers (whether hardware or software), seen on other platforms. Whilst Mesa has a type of driver support, this a compilation facility rather than an interface for run-time driver selection. Given the recent appearance of AGP graphics cards in the RISC OS desktop computer market (see section 2.2), this is an issue that needs to be addressed.
Floating point arithmetic and the speed at which it can be performed, is often a major factor in the design of graphics libraries and the suitability of computer hardware for graphics applications. So, what exactly are the capabilities of my chosen hardware and software platform in this area?
The ARM has a general co-processor interface, and for many years a floating-point co-processor has been a feature of some RISC OS computers7. Early machines could be interfaced to the AT&T WE32206 via a FPPC support chip, whilst ARM Ltd later produced their own FPA10 and FPA11 co-processors.
However, co-processors have a maximum clock speed and rely on a particular electrical interface to the ARM. The FPA11 is incompatible with more recent processors such as ARM8, ARM9 and StrongARM. Since today's ARM processors are targetted mainly at the embedded device market, the silicon cost and power consumption of a co-processor are also a significant factor. 
Where no floating-point hardware is available, a combination of fixed-point arithmetic and software emulation of floating-point operations must be used. Programs may either invoke a floating-point library directly via function calls, or use the floating-point instruction set and rely on transparent emulation of the hardware.
Under RISC OS, the instruction set is implemented by the FPEmulator module, which handles undefined instruction traps caused by the absence of a FP co-processor. The major advantage of this system over compiling code against a FP library is that the same programs can run unmodified using either a co-processor or the emulator, and yet immediately benefit when the hardware is available.
Unfortunately, any emulator of floating-point hardware is "intrinsically slow" . On execution of every FP instruction the processor's state must be saved, the instruction decoded and emulated (possibly referring to the saved state), the saved state modified (e.g. if the instruction changes flags/registers), and then restored so that execution of the program can continue.
The undefined instruction trap and all this state-saving and decoding add up to "a substantial overhead per floating-point instruction emulated" ARM Ltd report this to be typically about 2.5 times the actual overhead of emulating the operation, when compared to a statically linked library. 
Since vast majority of RISC OS computers rely on software
emulation of floating-point instructions, it might be a good
idea to avoid relying on floating-point arithmetic in the
design of my graphics library.
4 As per "1066 and all that."
5 Name given to Archimedes/Risc PC hardware expansion cards.
6 For brevity I have omitted a large number of minor features from the list as reproduced here.
7 First as an expensive and exotic upgrade for the ARM3-based A5000, and then as a standard (though largely unexploited) feature of the ARM7500FE-based A7000+ and clones.