Contents Previous chapter Next chapter


1 Language of implementation

Relocatable modules for RISC OS may be written either in C or in assembly language. However, because modules operate below the boundary between user programs and the operating system, special tools are needed to protect compiled C code from its environment.

Compilers for high-level languages such as C and Pascal produce code that conforms to the ARM Procedure Call Standard. The APCS defines strict rules regarding register usage, procedure call/return, stack backtrace and software stack-limit checking. In contrast, operating system code and extension modules written in assembly language are coded more freely, use the non-extensible SVC mode stack, and do not perform stack limit checking.

A tool called CMHG (C Module Header Generator) creates a module header, initialisation/finalisation code and a set of entry and exit veneers to interface the C run-time environment with low-level system entry points such as SWIs, CLI commands and IRQ handlers. The code to implement the functionality of the module is written in C, and must be compiled with a special compiler option to allow static data to be separated from the module code and multiply-instantiated.

The call overheads in implementing a C library as a module.

Modules written in C tend to be both larger and slower than modules written in assembly language, partly because code generated by compilers is less the optimal, but also because of the amount of code required to interface with the C library upon initialisation and the necessity of veneers to insulate the APCS run-time environment from the operating system level. Software stack-limit checking also imposes a minor performance penalty on every function call.

For these reasons, and particularly because speed is paramount for this application area, I decided to choose ARM assembly language for the implementation of the GL module. Potential portability is really not a consideration, since OpenGL implementations for most other platforms already exist, and in any case the need to obtain optimal performance on ARM systems would probably preclude this.

2 Choosing an assembler

I had access to three different ARM code assemblers, the BASIC assembler, Acorn's ObjAsm and Nick Roberts' ASM. I investigated the features of each and enquired within the programming community as to their relative merits before deciding which to use.

BBC BASIC has a built in assembler, which I have used for previous programming projects. Its main advantage as described by Philip Ludlam is that you can do "...neat BASIC tricks. Things like calling functions within the assembler code or looping through loops to generate data or code." [25] Thus the BASIC assembler effectively supports macros, conditional assembly and repetitive assembly 'for free', using the power of the underlying language.

The main disadvantage of the BASIC assembler is that it does not output linkable object files. Therefore large programs cannot be split into separately generated code modules, and it is not possible to link the output with external code libraries or object files produced by C and other compiled languages.

Acorn's ObjAsm was part of their commercial C/C++ product, which also includes a C compiler, C++ translator, debugger and linker. ObjAsm supports macros, conditional assembly (IF...ELSE...ENDIF), repetitive assembly (WHILE..WEND) and sophisticated symbolic capabilities (arithmetic, logical and string variables). According to Weiss Niklaus it has the advantage over the BASIC inline assembler that it can generate debugging data for use with DDT27, an interactive debugging tool. [25]

ObjAsm generates object files in the standard AOF format28. These can be linked to other object files (including those produced by the C compiler) by any AOF linker, such as Acorn's Link or the freeware DrLink alternative by David Daniels.

Lee Johnston expresses some doubt about the future state of support for ObjAsm: "I wouldn't say that ObjAsm is being actively maintained, or rather that it is but the latest versions don't seem to be available outside of Pace." [25]

Nick Roberts' freeware ASM is a simple ARM code assembler. Like ObjAsm it produces AOF files suitable for use with the linker, but it doesn't support repetitive assembly or many of ObjAsm's complex symbolic capabilities. Simple conditional assembly is supported however, using IFDEF/IFNDEF directives to test whether certain assembly-time flags are defined or undefined. Macros are supported.

The main feature that ASM has over ObjAsm is that it supports structures for organising blocks of memory. Lee Johnston describes how you "...load a register with the address of an instance of that structure and then access all the fields using a named label rather than offset. It's quite nice." [25]

I wanted to structure my project as a series of independently-compilable code modules, and wanted the option of statically linking these with test programs written in C. These requirements pointed to using one of the AOF generating assemblers.

I was attracted by ASM's support for named data structures29. I anticipated that these would be particularly invaluable in writing a module, since all data must be accessed via a workspace pointer for the current instantiation rather than directly. Being able to refer to a static variable as LDR r0,[r12,#workspace.screen_mode] rather than LDR r0,[r12,#12] makes code more readable and easier to maintain when structure formats are changed.

Compared to the relatively simplicity of ASM, the daunting complexity of ObjAsm discouraged me from learning how to use it. Due to project deadlines I needed to minimise the time spent learning to use a new assembler and concentrate on coding. At the time I did not anticipate needing to use repetitive assembly, though this assumption was later proved wrong.

Taking all these factors into account I eventually decided to use Nick Roberts' ASM, version 4.10. Should it ever become necessary to move the sourcecode to ObjAsm format then this will be considerably easier than if the BASIC assembler had been used, because the ASM source format is broadly similar to ObjAsm and they support many of the same assembler directives.

I used Acorn's ARM Linker version 5.25 to link the object files generated by ASM. The -AIF option was used during the development process, to link the code with a test program and output an Executable AIF30 image. The -RMF option was used for the final build, to output the graphics library as a relocatable module.

3 Computer hardware

The two machines used for development were:

Model name Acorn Risc PC (c.1994) Microdigital Mico (c.2000)
Processor 202 Mhz SA-110 rev. K 56 Mhz ARM7500FE
Floating point system FPEmulator 4.06 (14 Mar 1996) FPA11 floating point unit
Operating system RISC OS 3.70 (30 Jul 1996) RISC OS 4.03 (09 Nov 1999)
Memory 20 Mb SIMM + 1 Mb VRAM 64 Mb EDO RAM
Hard disc 4 Gb 8.4 Gb
Integer benchmark31 341,157 Dhrystones/sec 63,137 Dhrystones/sec
FP benchmark 2,621 kWhetstones/sec 8,314 kWhetstones/sec

This pair of machines are actually fairly representative of the RISC OS computers still being used, showing opposite ends of the spectrum. The Mico represents the various third party ARM7500-based machines that have flooded the market since the demise of Acorn Computers Ltd in 1998, whilst the StrongARM Risc PC represents the substantially upgraded Acorn systems still used by computing enthusiasts.

3.1 StrongARM system

The Risc PC is based around Digital's StrongARM 110 processor, which was first released as an upgrade in 1996.

Because it has a Harvard architecture with separate 16 kb data and instruction caches, code modification is much more expensive than previous ARM processors - cleaning the data cache can take up to half a millisecond (i.e. 100000 processor cycles) and the entire instruction cache must be flushed. [10]

However, it is clocked 5 times faster than any previous ARM, and furthermore many instructions execute in fewer cycles. It implements the ARMv4 instruction set, which includes 64-bit multiply and multiply-accumulate instructions that are unavailable on other RISC OS machines.

No floating point co-processor is available, so floating point instructions must be emulated in software. Given the relative rarity of such instructions, however, this deficiency is more than offset by the high speed of integer arithmetic and general data processing.

Unlike ARM7500-based machines, Risc PCs may include up to 2 Mb of fast dedicated video memory. This plays an important part in the machine's high performance, freeing the main data bus for the use of the processor and other DMA.

To make optimal use of StrongARM, software should avoid using the floating point instruction set if possible, and where appropriate make use of the long multiplication instructions provided. Use of self modifying code is strongly deprecated.

3.2 ARM7500FE system

This single chip computer has a rather different architecture from the StrongARM.

An ARM7 core32 is integrated with a video controller, memory/IO controller and FPA11 (floating point unit) on a single chip. Like earlier ARMs, it has a Von Neumann architecture, with a write buffer and 4 kb combined data/instruction cache. Thus there is no penalty for self-modifying code.

For software developers, the significant attributes of this chip are that floating point operations are very fast relative to the FP emulator, yet the main processor clock is relatively slow and is made even slower in high bandwidth screenmodes where up to 50% of data bus bandwidth can be taken by video DMA.

To make optimal use of ARM7500FE machines, software should be designed to run in low bandwidth screen modes and use the floating point co-processor to maximum possible effect (e.g. the FP divide instruction may actually execute faster than an integer divide, since the latter must be implemented as a macro).

The stated disparities in features and architecture between the two target computer systems provide good justification of my design objective that the API of my 3-D graphics library should transparently support different machine-specific implementations.

4 Memory management

In the main, the memory management facilities required by my module were provided through the operating system's OS_Heap and OS_DynamicArea SWIs.

Static data

The static data for each instantiation of the module is held in a small block of relocatable module area of fixed size allocated upon initialisation. This holds state variables such as the current colour, normal vector and matrix mode, and pointers to the matrix stacks and other dynamic buffers.

Dynamic heap

A separate dynamic area33 holds a heap managed by OS_Heap, from which blocks for non-static data such as the render buffer and vertex buffer are allocated. A private dynamic area is used rather than polluting the RMA because this allows any substantial amounts of memory used by the module to be deallocated and returned to the free pool at termination. If the RMA were used then heap fragmentation would prevent this.

In order to allow the user to resize the GL module's dynamic area, a handler for post-grow and pre-shrink events is installed. This asks OS_Heap to shrink the heap when the area is reduced (or refuses the movement if this is impossible), and when the area grows it informs OS_Heap that the heap may expand.

The GL module only shrinks the heap of its own volition when rendering of all the buffered graphics is completely by GL_Finish. By doing this occasionally rather than whenever a heap block is freed, the performance hit of constantly expanding and reducing the size of the heap is avoided, whilst ensuring that excess memory does not linger in the heap indefinitely.

Vertex and render buffers

The vertex buffer extends in 'chunks' which are essentially separate heap blocks that form a crude forward-linked list. This linked list structure has been chosen because extending the vertex buffer contiguously (e.g. using OS_Heap 4) would cause the block to be moved within the heap. But the vertex records must stay at the same absolute memory addresses for their entire lifetime, otherwise pointers within the render buffer would become invalid.

The computational costs of traversing the linked list are minor. Within the primitive construction routines invoked upon a call to GL_End, the overhead to check the vertex buffer read pointer against a chunk limit is just two instructions per vertex. If the limit is passed then a small function is invoked to set up the vertex pointer and limit for the next chunk, or else quit if that was the last chunk (indicated by link = 0):

; We reached end of chunk
; r6 = vertex buffer read pointer ( >= r7)
; r7 = vertex buffer limit
LDR r6,[r7,#0] ; first word after chunk limit points to next chunk...
CMP r6,#0 ; ...or null if no more
  BEQ process_vb_exit ; finished reading from vertex buffer

The render buffer, conversely, is extended contiguously using OS_Heap 4. In other words it may shift around within the heap, but it can always be traversed from top to bottom simply by incrementing a pointer from its current base address.

Extending a very large buffer in this fashion is not particularly good practice, since it can cause serious heap fragmentation. However, these effects are unlikely to cause real problems on my system given the small number of primitives likely to be stored in the render buffer and the infrequency with which it needs to be extended.

It is important that the render buffer be contiguous in memory, for the implementation of the sort used by painter's algorithm. Theoretically it may be possible to sort data stored in a linked list (sort individual chunks and then merge them?), but I did not investigate any algorithms for doing so. Instead, my sole focus was on finding the fastest sort algorithm possible (see section 7).

In the case of both the render buffer and the vertex buffer, the granularity with which they are extended is deliberately high. The vertex buffer allows room for 64 vertex records per chunk, and the render buffer chunk allows 32 shape records per chunk. Thus, for many simple scenes neither will need to be extended. This is an important factor in performance, since the process of extension is costly and we therefore want to do it as rarely as possible. The speed at which the vertex buffer can be traversed during primitive construction would also be crippled if it were necessary to stop and change to a new VB chunk every few vertices.

5 Primitive construction

I now present a walk through the process of primitive construction, in attempt to explain in detail the operation of the vertex and render buffers.

Imagine that SYS "GL_Begin",5 has been called, signalling the start of a triangle strip. VB write pointer is the location where the first triangle strip vertex will be written, and VB write limit points to the end of the current VB chunk. The initial values of these pointers are saved in VB read pointer and VB read limit, for later use during primitive construction.

There follow six calls to SYS "GL_Vertex" - the coordinates and colour data of the vertices are unimportant in the context of this explanation. As each vertex arrives, it is stored in the vertex buffer at VB write pointer, which is then incremented for the next vertex.

When VB write pointer meets VB write limit (after vertex 5), a new VB chunk is allocated and the address is stored in the link word at VB write limit. When VB write pointer and VB write limit have been set up for the new chunk, buffering of vertices continues until GL_End is called shortly afterwards.

The following diagram shows the state of the vertex and render buffers and their associated pointers as they would appear at this point in time:

The vertex and render buffers and associated pointers as they appear at the beginning of primitive construction, e.g. on entry to GL_End

When GL_End is called, the buffered vertices are read sequentially from VB read pointer. Triangle primitives are constructed from this stream of vertices and recorded in the render buffer with pointers back to their corner vertices.

When VB read pointer reaches VB read limit, it is reset to the base of the second chunk using the link address stored at VB read limit. We detect that we have reached the 'current' (final) vertex buffer chunk, because VB read pointer + VB chunk size = VB write limit. Therefore, VB read limit is set to VB write pointer rather than the end of the chunk.

Construction of triangle primitives continues using vertex 6 from the second VB chunk, until VB read pointer reaches VB read limit for the second time. We detect that we have reached the end of the stored vertices rather than the end of the current block, because VB read limit = VB write pointer.

The following diagram shows the state of the vertex and render buffers and their associated pointers as they would appear at this point in time34:

The same imaginary scenario as it would appear at the close of primitive construction, e.g. on exit from GL_End

6 Depth sort

The painter's algorithm requires surfaces to be sorted by depth prior to rasterisation, so that surfaces can be painted to the screen in decreasing depth order.

A test program showing the sorting of many graphics primitives by depth

Because of the serious impact that an inefficient sort (such as bubble sort) would have on the system's performance, I did some research on the relative performance and suitability of different sorting algorithms, I found that two sorting algorithms most often recommended for depth sorting were 'heap sort' and 'quick sort'.

According to Todd Sundsted [23] "The most efficient, general purpose sort is one based on C.A.R. Hoare's quicksort algorithm." Robert Ohannessian disagrees, favouring heap sort over quick sort: "The Quick Sort, however, is less than optimal as we shall see... Unlike Quick Sort, [heap sort's] worst case is , making it very, very useful." [22]

"Quicksort is one of the fastest and simplest sorting algorithms. It works recursively by a divide-and-conquer strategy."[24] One of the values in the array is chosen as the 'pivot' - all values less than the pivot are moved to its left, and all values greater than it to its right. The algorithm is then applied recursively to the left and right halves of the array, yielding smaller and smaller subsets until the whole array is sorted.

Quick sort's best case time complexity is , obtained when each partitioning produces two parts of equal length. In the worst case scenario (where the partitioning is maximally unbalanced) the complexity is - as bad as bubble sort.

Heap sort is based upon a left-complete binary tree, where each node has a value greater than that of its children. When this is true, the tree is a 'heap', and the process of making sure that each parent node is greater than its children is called 'heapification'. This structure can be organised in memory as a conventional array by laying the tree out flat, row by row. Once the array contents have been 'heapified', the root node (guaranteed to be the largest value) is removed to a section of the array that is designated 'sorted'. The remainder of the array is then 'reheapified' so that the the new root node is now the second largest value... this cycle of removal of the root node and reheapification of the remainder is repeated until all the values are sorted.

Given that timing analysis shows quick sort's best case to be only as good as heap sort's worst case, when I wrote an optimised implementation of heap sort I fully expected it to outperform the non-optimised quick sort that I had compiled with minimal attention from some example C sourcecode.

I was therefore rather surprised to find that quite the reverse was true, as shown by the following timings on data sets of different sizes. The test data was an array of 32-bit integers generated randomly for each comparative test run. The same test data was sorted by each algorithm to ensure a fair comparison35, and a subset of the sorted set was output to show that the two algorithms were producing consistent results.

Number of elements Quick sort Heap sort
1,000 0.01s 0.01s
100,000 1.49s 2.91s
1,000,000 27·86s 37·80s
2,000,000 70·76s 72·93s
3,000,000 149·44s 122·76s

As can be seen from these timings (each averaged over several runs), the heap sort did not even begin to outperform the quick sort until there were more than 2 million elements in the array.

I did eventually find another voice to give credence to my findings - "There are sorting algorithms with a time complexity of even in the worst case, e.g. Heapsort and Mergesort. But on the average, these algorithms are by a constant factor slower than Quicksort." [24] The explanation appears to be that although heap sort is theoretically less complex than quick sort, it is unstable. In other words, it causes more data swaps when re-ordering the array. It can be imagined that this would cause a major impact on performance, particularly since the test system was a 200 Mhz processor coping with a 16 Mhz memory bottleneck.

I felt that this deficiency in the heap sort algorithm would be compounded when swapping real data (i.e. render buffer records, at 32 bytes each) rather than plain integer values. Also, the likelihood of the render buffer ever containing 2 million primitives is virtually nil. Everything pointed to quicksort as the better choice.

7 Rasterisation

The first thing to happen is that various information about the current screen mode and VDU drivers state is read, by calling OS_ReadVduVariables with a pointer to a list of variable numbers. This tells us the configuration of the frame buffer, its address in memory, the screen resolution, the size in bytes of the whole screen and that of an individual screen line. The bounding coordinates of the graphics window and the position of the graphics origin are also obtained.

If the current screen mode is inappropriate for rendering, e.g. it is a teletext or other non graphics mode, or a graphics mode with less than 8 bits per pixel, then the module returns a polite error message "Cannot plot in this screen mode". The flashing text cursor is removed from the screen for the duration of the rendering process, since this might interfere with graphics plotting.

7.1 Lines

The obvious way to draw lines is to calculate the line's gradient (requiring a division) and use fractional x/y increments. However, for long line segments accumulated rounding errors "can cause the calculated pixel positions to drift away from the true line path", and furthermore the necessary gradient division and floating-point arithmetic are costly. [2]

Instead, the GL module plots lines using an accurate and efficient algorithm developed by Bresenham, that uses only integer calculations. Sampling a line at unit intervals, an integer 'decision parameter' determines for a given pixel whether the next pixel to plot should be or .

Successive values of are obtained recursively using incremental integer calculations. The algorithm for lines with a positive gradient < 1 may be summarised as follows, where and are the horizontal and vertical distances between the line's endpoints (from Hearn and Baker):


For each along the line, starting at :

If , next point to plot is , and


Otherwise, next point to plot is , and


This algorithm is generalised to lines with arbitrary slope by selecting one of four plotting routines, depending on the relative positions of the line endpoints A and B:

Two signed differences and are calculated, in order to determine the direction of the line.

Firstly, if B is left of A (i.e. ) then the two endpoints are reversed. This immediately eliminates one half of the eight possible octants in which a line can lie (the shaded area on the diagram below), by guaranteeing that all lines are drawn from left to right (A to B).

If B is higher or same as A (i.e. ) then the line is identified as having a positive slope: line_pos_shallow is called if the gradient is 1 or less (i.e. ), otherwise line_pos_steep is called.

If B is lower than A (i.e. dy<0) then the line is identified as having a negative slope: line_neg_shallow is called if the gradient is -1 or greater (i.e. ), otherwise line_neg_steep is called.

Selection of Bresenham plotting code, depending on line slope

There is a reason that I implemented four separate plotting routines rather than a single routine for x-major lines and one for y-major lines, both generalised to accept signed increments:

The procedure for clipping lines to the graphics window is different, depending on whether the line is being drawn upwards or downwards. For negative slope lines (those in the bottom right quadrant of the diagram) plotting terminates when the line passes the bottom or righthand side of the clipping window, whereas for upward facing lines (in the top right quadrant) line plotting terminates when the line passes the top or righthand side of the clipping window.

Output from a test program that allows the vertices of a 3-segment line loop construct to be moved about the screen. The lines have been clipped to the current graphics window.

7.2 Flat shaded triangles

The first step in triangle rasterisation is to find which of the corner vertices A, B and C has the lowest y coordinate. The vertices are rearranged if necessary to ensure that vertex A is lowest (as with lines, this allows greater use of generic code). Vertex A will be the starting point for plotting the triangle.

Next, plane equations are used to determine which of the two higher vertices B and C is 'left' and which is 'right', relative to each other:


If the above expression evaluates as false, then vertex B is not leftmost and it is swapped with vertex C. We now know the endpoints of the initial lefthand (A to B) and initial righthand (A to C) edges of the triangle, which allows us to calculate the left and right gradients (as fractional x increments per scanline).

The final step before we can draw the triangle is to find which of vertices B and C is the 'middle' vertex and which is the 'top' vertex. This is established by a simple comparison of y coordinates.

Scan conversion of the lower half of a triangle, by flat_tri_segment

The bottom and top halves of the triangle are drawn separately by calling a generalised routine flat_tri_segment, which plots successive pixel spans of a polygon between any two y-major edges. The necessary parameters are a colour, upper & lower y bounds and the origin and gradient of the segment edges. The edges are specified as initial left x and right x coordinates and a fractional increment to be applied to each side per unit y. For fractional accuracy, all x coordinate calculations use fixed-point q=12 form (accurate to ).

The bottom half is plotted from to if B is the 'middle' vertex, otherwise from to . Upon reaching a height parallel to the 'middle' vertex a new gradient is calculated for whichever edge, left or right, spans from the 'middle' vertex to the 'top' vertex. (The righthand edge C to B in the example above.) Finally, flat_tri_segment is called again to plot the top half of the triangle.

7.3 Gouraud shaded triangles

Hearn and Baker: "This intensity-interpolation scheme, developed by Gouraud and generally referred to as Gouraud shading, renders a polygon surface by linearly interpolating intensity values across the surface." [2]

If perspective-correct shading is not required then colours may be calculated iteratively using a fixed increment for red, green and blue colour components. Perspective-correct interpolation of either colours or texture coordinates across a polygon surface is very much more costly36, so it is fortunate that linear approximation is generally adequate where individual surfaces do not span large z distances.

My Gouraud triangle renderer was implemented by extending the basic mono-colour triangle-plotting routine I had implemented before, with a lot of extra code to calculate fractional increments for colour components across each pixel span, and along each outside edge between vertex colours.

I managed to cut down on the number of individual divisions necessary, by using the technique described by ARM Ltd for instead multiplying by the reciprocal of the divisor [8]: Having calculated in fixed point q=16 form, I can multiply by the red difference to get the red increment per line, multiply by the x difference to get the x increment per line... and similarly for the green and blue components.

Another clever trick used is that of packing the intensity value and increment-per-line for a colour component into a single register. Each part is stored in fixed point q=8 form, with the increment occupying the top 16 bits of the intensity value. This massively reduces the frequency of variable spillage to memory, and hence the complexity of the whole routine. Updating the edge colour components upon moving up a scanline is reduced to:

ADD r0,r0,r0,ASR#16
ADD r1,r1,r1,ASR#16
ADD r2,r2,r2,ASR#16 ; inc left colour by left colour gradient

Because pixel colours vary across the triangle, it is no longer possible to call ColourTrans just once before plotting, to obtain a single colour number for an 8bpp screen mode. Fortunately a scheme has been implemented to avoid the alternative of calling ColourTrans_ReturnColourNumber for every pixel plotted.

The 24bpp palette entry and corresponding colour number are retained from the previous pixel plotted. If the new palette entry calculated from the colour components and increments matches the previous colour, then the old colour number is used again. Thus when plotting a shaded span of pixels ColourTrans is only called once per change of colour:

CMP r11,lr ; same as last palette entry?
  BEQ gouraud_hozline_8bpp_plot ; yes
MOV lr,r11 ; remember new palette entry
MOV r0,r11 ; look up colour nearest new palette entry
SWI "ColourTrans_ReturnColourNumber"
; returns r0 = new colour number

STRB r0,[r8,r6] ; plot pixel

(A far preferable solution would be to have an internal colour lookup table that could be consulted as frequently as necessary, at much less than the cost of a SWI call. This has not yet been implemented, however.)

Gouraud shading is inevitably slower than flat shading, because of the extra division required to determine colour gradients across pixel spans, and also because holding colour components and their increments in addition to all the other working data inevitably puts a severe strain on register allocation.

Output from a test program that allows the vertices of a gouraud-shaded triangle to be moved about the screen, and the R,G,B colours at each vertex to be updated. The triangle has been clipped to the current graphics window.

After all the primitives in the render buffer have been rasterised, both the render buffer and the vertex buffer are deallocated. In the case of the render buffer this is simple matter of freeing the block. In the case of the vertex buffer, the linked list of VB chunks must be traversed from the bottom upwards, freeing each block in turn.

8 Optimisation

The most general form of optimisation performed was careful allocation of register usage, so that variables are not spilled to memory if at all possible. If variables must be spilled to memory then preferably they should be the values that are accessed relatively infrequently.

A C compiler allocates a fixed amount of stack space for local non-register variables on function entry. These non-register variables are then loaded and stored as needed, using individual LDR/STR instructions.

When hand-coding assembly language you tend use the stack in a more conventional FIFO37 fashion. A common action is to push whole groups of variables onto the stack in a single motion, for the duration of some subtask. This can be done relatively cheaply using a single STM instruction, giving the following benefits over a sequence of individual STR instructions: "The code is smaller (and thus will cache better...", and " instruction fetch cycle and a register copy back cycle is saved for each LDR or STR eliminated." [9]

9 Division

Because the ARM processor has no divide instruction, a division routine is a common feature of most RISC OS programs of any complexity. Essentially they come in two varieties - rolled and unrolled. In unrolled implementations the inner loop is implemented at assembly-time rather than in the output code. Replication of the body of the loop results in very much larger output, but the branch back to the beginning is eliminated.

I decided against using an unrolled division loop for two reasons - firstly I was mindful of the warning in the ARM cookbook that "it doesn't always gain anything, and in some situations can reduce performance" [9], and also because an unrolled loop would make the code too bulky for inclusion in a macro38.

I wanted to use a division macro rather than a real subroutine not so much because of the cost of the call (a branch is relatively cheap on ARM), but because of the indirect cost of martialling arguments in specific registers (which may be being used for other purposes) and of the subroutine saving any registers that it corrupts, ignorant of whether or not this is actually necessary.

Use of a macro allowed me more control over which registers were the divisor and dividend, which was the destination register for the quotient, and which registers could safely be corrupted by the operation. Given the choice, I found that I could invariably manage to find a couple of scratch registers without resorting to using the stack as a subroutine would do.

MACRO unsigned_divide,Rquotient,Rdividend,Rdivisor,Rtemp
  CMP Rdivisor,#0 ; check for divide-by-zero
    MVNEQ Rquotient,#&80000000 ; largest signed 32-bit number
    BEQ divide_end
  MOV Rtemp,Rdivisor         ; Put the divisor in Rtemp.
  CMP Rtemp,Rdividend,LSR #1
    MOVLS Rtemp,Rtemp,LSL #1 ; Then double it until
  CMP Rtemp,Rdividend,LSR #1 ; 2 * Rtemp > dividend.
    BLS divide_loop1
  MOV Rquotient,#0           ; Initialise the quotient
  CMP Rdividend,Rtemp        ; Can we subtract Rtemp?
    SUBCS Rdividend,Rdividend,Rtemp ; If we can, do so
  ADC Rquotient,Rquotient,Rquotient ; Double quotient and add new bit
  MOV Rtemp,Rtemp,LSR #1 ; Halve Rtemp
  CMP Rtemp,Rdivisor     ; And loop until we've gone
    BHS divide_loop2     ; past the original divisor

Since there is no point re-inventing the wheel, my division macro is based upon the divide-and-remainder code given in the Acorn Assembler manual [4]. At only fifteen instructions (for unsigned division), it is not too onerous for the code to be included every time it is used.

10 Debugging and execution trace

It is actually quite difficult to print out debugging and trace information from an assembly language program compared to, say, calling printf() from a C program. The operating system provides a number of SWIs that aid assembly language programmers in this task, chiefly OS_Write0 (write text string pointed to by R0), OS_WriteS (write embedded text string) and OS_NewLine (write line feed/carriage return).

The OS_ConvertHex8 and OS_ConvertInteger4 SWIs are also invaluable for converting integer values into strings for output to the screen, but there is unfortunately no equivalent call that will output values with a fractional component. This proved to be a major stumbling block, since fixed-point fractional values are predominant in my GL module.

The solution eventually adopted is quite grotesque: The fixed-point value to be output is first transferred to a floating-point register and divided by 2exponent, then stored to memory in packed decimal format (STFP). A complex subroutine is invoked to laboriously print the fractional value in standard form, character by character:

STMFD sp!,{lr} ; safeguard R14_SVC
SFM f0,2,[sp,#-2*12]!
FLTS f0,regno ; move fixed point value to a floating point register
LDFS f1,conv_divisor
FDVS f0,f0,f1 ; convert to q=0 form
BL debug_printdouble ; convert to packed decimal and print it
LFM f0,2,[sp],#2*12
LDMFD sp!,{lr}
B continue

conv_divisor:DCFS float(1<<q)

SWI "OS_NewLine"

Obviously, frequently typing out complex debugging code such as that shown above would have been a complete nightmare, almost certainly introducing more bugs than it would have enabled me to find. Therefore, all the common debugging/trace operations are encapsulated in helpfully named macros such as tracelabel, printdec, printhex and printfxp.

The use of macros allowed me to output large amounts of verbose debugging and trace information, whilst maintaining relatively clean-looking and readable code. By bracketing each macro definition by IFDEF trace... ENDIF assembler directives, all that is required to produce a clean version of the module is to undefine a single symbol.

The only real blot on this picture is that ASM does not appear to accept text strings as macro parameters. Fortunately, OS_WriteS allows plain text strings to be output without too much pain, mainly due to the fact that invocation requires no registers (the string to print is embedded in the program code right after the SWI instruction).

27  Acorn's Desktop Debugging Tool.

28  ARM Object Format, the standard intermediate format for code output by language processors.

29  Whilst Nemo (comp.sys.acorn.programmer) gives a method of creating ObjAsm constructions similar to the structures explicitly supported by ASM, the equivalent ObjAsm syntax appears more complicated.

30  ARM Image Format, the usual format for executable program code under RISC OS.

31  These benchmarks were measured using SICK v0.94, a reputable system analyser and speed indexer.

32  Note that this is distinct from the ARM7M, which first introduced the long multiply instructions supported by the StrongARM.

33  Dynamic areas are large areas of memory such as the free pool, the module area and the screen memory, for which the mapping of logical to physical addresses is managed by hardware. Thus, their size can be changed transparently without the base address appearing to change.

34  For the sake of clarity, I have drawn the link between primitives and vertices using arrowed lines going from vertex buffer to render buffer. Whilst this may be thought of as the direction of data flow, the actual pointer reference goes the other direction.

35  The speed of sorting can vary considerably depending on the distribution of test values in a particular data set, and how unsorted the set is. This is especially true for quicksort, where choosing the wrong pivot will cause all the data to be moved.

36  Especially given the ARM's lack of a divide instruction - the architecture was designed for normal programs where divisions are rarely necessary, not for rendering perspective-correct polygons at one division-per-pixel!

37  First-In-First-Out. The conventional stack model where items only enter and leave at the head of the stack, by 'push' and 'pull' operations. Items further down the stack are temporarily inaccessible until any items pushed on after them have been pulled off.

38  There was also the 'minor consideration' that ASM doesn't actually support repetitive assembly(!) This is probably symptomatic of the fact that unrolling loops has gone out of fashion.

Contents Previous chapter Next chapter