Useful programming tips

Revision as of 12:41, 30 May 2020 by Darc (talk | contribs)
Jump to navigation Jump to search

The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many programming paradigms that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,[1] a couple of 128-bit vector operations on 4x packed 32-bit floats,[2] a memory management unit (MMU), and a direct memory access controller (DMAC).

In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH[3]), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.

The following page is a collection of programming tips and tricks to help with optimizing programs to make full use of the SH4 CPU. This is not meant to be a substitute for reading the SH7750 series hardware and software manuals, rather it should be seen more as an additional reference based on experiences working with the chip (and, in fact, certain terms and hardware-specific concepts assume familiarity with those manuals). Both manuals, "SH7750, SH7750S, SH7750R Group User's Manual: Hardware" and "SH-4 Software Manual," can be downloaded from the "Documents" section of any SH7750 series processor's product page on Renesas's website:, and the SH4 C ABI on STMicroelectronics's website, "RM0197: SH-4 generic and C specific application binary interface," is incredibly handy, too--search for RM0197:

A very convenient SuperH assembly reference can be found here, as well:

This page refers to the documents as follows:

  • SH7750 Hardware Manual: "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"
  • SH7750 Software Manual: "SH-4 Software Manual"
  • SH4 C ABI: "RM0197: SH-4 generic and C specific application binary interface"


(Refer to: SH7750 Hardware Manual, Section 5 "Exceptions")

The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.

Take the following example, which defines a packed structure aligned to 4 bytes:

typedef struct __attribute__ ((packed, aligned(4))) {
	unsigned char id[4];
	unsigned int address;
	unsigned int size;
	unsigned char data[]; // Flexible array member
} command_t;

Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.

Doing this:

unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);

Produces this output (GCC 9.2.0):

420 02e4 862F mov.l r8,@-r15
421 02e6 4365 mov r4,r5
422 02e8 962F mov.l r9,@-r15
423 02ea 0C75 add #12,r5
424 02ec 224F sts.l pr,@-r15
425 02ee 4159 mov.l @(4,r4),r9
426 02f0 4256 mov.l @(8,r4),r6
427 02f2 12D0 mov.l .L71,r0
428 02f4 9869 swap.b r9,r9
429 02f6 6866 swap.b r6,r6
430 02f8 11D4 mov.l .L72,r4
431 02fa 6966 swap.w r6,r6
432 02fc 9969 swap.w r9,r9
433 02fe 9869 swap.b r9,r9
434 0300 6868 swap.b r6,r8

But what if it weren't aligned to 4 bytes? Just this:

typedef struct __attribute__ ((packed)) {
	unsigned char id[4];
	unsigned int address;
	unsigned int size;
	unsigned char data[]; // Flexible array member
} command_t;

Accessing the data looks like this, in that case (GCC 9.2.0):

555 03f0 862F mov.l r8,@-r15
556 03f2 4365 mov r4,r5
557 03f4 962F mov.l r9,@-r15
558 03f6 0C75 add #12,r5
559 03f8 224F sts.l pr,@-r15
560 03fa 4484 mov.b @(4,r4),r0
561 03fc 0C63 extu.b r0,r3
562 03fe 4584 mov.b @(5,r4),r0
563 0400 0C61 extu.b r0,r1
564 0402 4684 mov.b @(6,r4),r0
565 0404 1861 swap.b r1,r1
566 0406 0C60 extu.b r0,r0
567 0408 3B21 or r3,r1
568 040a 2840 shll16 r0
569 040c 0B21 or r0,r1
570 040e 4784 mov.b @(7,r4),r0
571 0410 2840 shll16 r0
572 0412 1840 shll8 r0
573 0414 1B20 or r1,r0
574 0416 0869 swap.b r0,r9
575 0418 4884 mov.b @(8,r4),r0
576 041a 9969 swap.w r9,r9
577 041c 0C63 extu.b r0,r3
578 041e 4984 mov.b @(9,r4),r0
579 0420 9869 swap.b r9,r9
580 0422 0C62 extu.b r0,r2
581 0424 4A84 mov.b @(10,r4),r0
582 0426 2862 swap.b r2,r2
583 0428 0C60 extu.b r0,r0
584 042a 3B22 or r3,r2
585 042c 2840 shll16 r0
586 042e 0B22 or r0,r2
587 0430 4B84 mov.b @(11,r4),r0
588 0432 12D4 mov.l .L73,r4
589 0434 2840 shll16 r0
590 0436 1840 shll8 r0
591 0438 2B20 or r2,r0
592 043a 0860 swap.b r0,r0
593 043c 0961 swap.w r0,r1

All of this is just from this simple operation:

unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);

What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:

mov.b, zero-extend
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add

(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)

Considering that 1x mov.b takes the same amount of time as 1x mov.l, plus all the other operations that must be done to build the 4-byte data out of 1-byte accesses, it's easy to see how big the performance hit from mismanaging alignment can be!

Cache Management

(Refer to: SH7750 Hardware Manual, Section 4 "Caches" and SH7750 Software Manual, Section 9 "Instruction Descriptions")

Unlike modern processors, where caches are several megabytes in size and can therefore hold entire programs, the SH4 in the Dreamcast only has a 16kB data cache and 8kB instruction cache. Consequently, cache management is very important in order to achieve maximum performance. As is always true of cache optimization, write-back memory mode is required to make much use of it (it's used everywhere by default in KallistiOS and enabled for P0/U0/P3--but not P1--in DreamHAL's startup file).

Half of the data cache can also be used as a form of high-speed RAM (referred to as OCRAM), but in most cases programs should stick to using the full cache size for cache purposes. The SH4 uses a direct-mapped cache, meaning that there is only one cache entry for every 16kB memory chunk (half that when used in OCRAM mode) and cache trashing can happen if trying to do something like copy data from some address to a destination address that is an integer multiple of the cache size away from that address (e.g. source = address offset 8 and destination = address 16kB + 8).

The SH4 provides the following instructions in addition to the two 32-byte "store queues" (SQs) to make efficient use of the cache:

  • movca.l: Store register data to cache, if there's a cache miss just allocate a cache block and write to it without first reading that cache block from memory
  • ocbp: Purge cache block; write back cache block and invalidate it
  • ocbi: Invalidate cache block without writing it back
  • ocbwb: Write cache block back to external memory, and keep it in the cache

C Function Register Allocation

(Refer to: SH4 C ABI)

(Note: when using GCC 9.x at various optimization levels, like -O3, it tries its best to coalesce output code into this format wherever it can. Of course, if GCC is able to inline a function, parameter-passing becomes a moot point.)

The SH4 C ABI specifies that 4 integers (r4-r7) and 8 floats (fr4-fr11) can be passed in registers as function call arguments, and that r0-r3 and fr0-fr3 are also call-clobbered. Passing arguments in registers means that functions can take 4 integers and 8 floats without forcing arguments to be pushed on the stack, saving the cycle penalties that would otherwise occur from stack pushes and associated memory accesses. The call-clobbering of r0-r3 and fr0-fr3 means that those can be used as 4 integer local variables and 4 float variables, as well. Additionally, any of these registers not used for parameters can be repurposed as local variables, so if one only needs to pass in 4 floats to a function, one can then define 4 more local variable floats on top of the 4 we get from fr0-fr3 and they will just use the unused registers.

Pipelining and Instruction-Level Parallelism

(Refer to: SH7750 Hardware Manual, Section 8 "Pipelining")

This section is really only relevant when writing assembly. If you write code in a high-level language like C/C++, compilers [try to] take of this for you and there isn't much you can do about it.

Because of the SH4's dual-issue superscalar design, the CPU preloads two instructions at once, and under the right circumstances these instructions can be executed in parallel. The SH4 architecture organizes various instructions into "instruction groups," and parallel execution primarily occurs when two instructions of different groups are issued together. There are a variety of special cases to this rule of thumb, however, and more advanced code can be structured to take advantage of these properties.

For example, if the two instructions are of different groups but have a dependency chain, the second instruction will stall into the next cycle, and there is also the fact that CO group instructions do not parallelize with anything. Conversely, there are special cases like 0-cycle instructions that can execute in parallel despite having dependency chains (e.g. a "mov Rn, Rm" followed by an "add #imm8, Rm"), and MT group instructions that can parallelize with other MT group instructions (unless there's a non-special-case dependency chain).

Appropriate usage of instruction-level parallelism is the only way to achieve >200 MIPS (millions of instructions per second) on a 200MHz SH4.


  1. It's predominantly used for single-precision operations: it can do doubles, but that doesn't mean it's a great idea!
  2. See fipr, ftrv: