Useful programming tips: Difference between revisions

From dreamcast.wiki
Jump to navigation Jump to search
No edit summary
(I'm sorry, but the way this is worded is just not quite correct. He's talking about aspects of the processor, but he's making it sound like modern software PARADIGMS and practices don't apply to DC, when in reality we have C++20 and even all of the async, concurrent crap used in multicore working just fine.)
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many programming paradigms that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,<ref>It's predominantly used for single-precision operations: it ''can'' do doubles, but that doesn't mean it's a great idea!</ref> a couple of 128-bit vector operations on 4x packed 32-bit floats,<ref>See fipr, ftrv: http://www.shared-ptr.com/sh_insns.html</ref> a memory management unit (MMU), and a direct memory access controller (DMAC).
The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many hardware features that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,<ref>It's predominantly used for single-precision operations: it ''can'' do doubles, but that doesn't mean it's a great idea!</ref> a couple of 128-bit vector operations on 4x packed 32-bit floats,<ref>See fipr, ftrv: http://www.shared-ptr.com/sh_insns.html</ref> a memory management unit (MMU), and a direct memory access controller (DMAC).


In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH<ref>https://lwn.net/Articles/647636/</ref>), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.
In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH<ref>https://lwn.net/Articles/647636/</ref>), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.


The following page is a collection of programming tips and tricks to help with optimizing programs to make full use of the SH4 CPU. '''''This is not meant to be a substitute for reading the SH7750 series hardware and software manuals,''''' rather it should be seen more as an additional reference based on experiences working with the chip (and, in fact, certain terms and hardware-specific concepts assume familiarity with those manuals). Both manuals, "SH7750, SH7750S, SH7750R Group User's Manual: Hardware" and "SH-4 Software Manual," can be downloaded from the "Documents" section of any SH7750 series processor's product page on Renesas's website: https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/superh/sh7750/sh7750r.html, and the SH4 C ABI on STMicroelectronics's website, "RM0197: SH-4 generic and C specific application binary interface," is incredibly handy, too--search for RM0197: https://www.st.com/content/st_com/en.html. A convenient SuperH assembly reference can be found here, as well: http://www.shared-ptr.com/sh_insns.html.
The following page is a collection of programming tips and tricks to help with optimizing programs to make full use of the SH4 CPU. '''''This is not meant to be a substitute for reading the SH7750 series hardware and software manuals,''''' rather it should be seen more as an additional reference based on experiences working with the chip (and, in fact, certain terms and hardware-specific concepts assume familiarity with those manuals). Both manuals, "SH7750, SH7750S, SH7750R Group User's Manual: Hardware" and "SH-4 Software Manual," can be downloaded from the "Documents" section of any SH7750 series processor's product page on Renesas's website: https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/superh/sh7750/sh7750r.html, and the SH4 C ABI on STMicroelectronics's website, "RM0197: SH-4 generic and C specific application binary interface," is incredibly handy, too--search for RM0197: https://www.st.com/content/st_com/en.html.
 
A very convenient SuperH assembly reference can be found here, as well: http://www.shared-ptr.com/sh_insns.html.
 
This page refers to the documents as follows:
* SH7750 Hardware Manual: "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"
* SH7750 Software Manual: "SH-4 Software Manual"
* SH4 C ABI: "RM0197: SH-4 generic and C specific application binary interface"
 


== Alignment ==
== Alignment ==
(Refer to: SH7750 Hardware Manual, Section 5 "Exceptions")


The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.
The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.
Line 11: Line 21:
Take the following example, which defines a packed structure aligned to 4 bytes:
Take the following example, which defines a packed structure aligned to 4 bytes:


  typedef struct __attribute__ ((packed, aligned(4))) {
<syntaxhighlight lang="c">
  unsigned char id[4];
typedef struct __attribute__ ((packed, aligned(4))) {
  unsigned int address;
unsigned char id[4];
  unsigned int size;
unsigned int address;
  unsigned char data[]; // Flexible array member
unsigned int size;
  } command_t;
unsigned char data[]; // Flexible array member
} command_t;
</syntaxhighlight>


Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.
Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.


Doing this:
Doing this:
 
<syntaxhighlight lang="c">
  unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_addr = ntohl(command->address);
  unsigned int cmd_size = ntohl(command->size);
unsigned int cmd_size = ntohl(command->size);
</syntaxhighlight>


Produces this output (GCC 9.2.0):
Produces this output (GCC 9.2.0):
 
<syntaxhighlight lang="objdump">420 02e4 862F    mov.l r8,@-r15
420 02e4 862F    mov.l r8,@-r15
421 02e6 4365    mov r4,r5
421 02e6 4365    mov r4,r5
422 02e8 962F    mov.l r9,@-r15
422 02e8 962F    mov.l r9,@-r15
423 02ea 0C75    add #12,r5
423 02ea 0C75    add #12,r5
424 02ec 224F    sts.l pr,@-r15
424 02ec 224F    sts.l pr,@-r15
425 02ee 4159    mov.l @(4,r4),r9
425 02ee 4159    mov.l @(4,r4),r9
426 02f0 4256    mov.l @(8,r4),r6
426 02f0 4256    mov.l @(8,r4),r6
427 02f2 12D0    mov.l .L71,r0
427 02f2 12D0    mov.l .L71,r0
428 02f4 9869    swap.b r9,r9
428 02f4 9869    swap.b r9,r9
429 02f6 6866    swap.b r6,r6
429 02f6 6866    swap.b r6,r6
430 02f8 11D4    mov.l .L72,r4
430 02f8 11D4    mov.l .L72,r4
431 02fa 6966    swap.w r6,r6
431 02fa 6966    swap.w r6,r6
432 02fc 9969    swap.w r9,r9
432 02fc 9969    swap.w r9,r9
433 02fe 9869    swap.b r9,r9
433 02fe 9869    swap.b r9,r9
434 0300 6868    swap.b r6,r8</syntaxhighlight>
434 0300 6868    swap.b r6,r8


But what if it weren't aligned to 4 bytes? Just this:
But what if it weren't aligned to 4 bytes? Just this:
 
<syntaxhighlight lang="c">
  typedef struct __attribute__ ((packed)) {
typedef struct __attribute__ ((packed)) {
  unsigned char id[4];
unsigned char id[4];
  unsigned int address;
unsigned int address;
  unsigned int size;
unsigned int size;
  unsigned char data[]; // Flexible array member
unsigned char data[]; // Flexible array member
  } command_t;
} command_t;
</syntaxhighlight>


Accessing the data looks like this, in that case (GCC 9.2.0):
Accessing the data looks like this, in that case (GCC 9.2.0):
 
<syntaxhighlight lang="objdump">555 03f0 862F            mov.l    r8,@-r15
555 03f0 862F            mov.l    r8,@-r15
556 03f2 4365            mov    r4,r5
556 03f2 4365            mov    r4,r5
557 03f4 962F            mov.l    r9,@-r15
557 03f4 962F            mov.l    r9,@-r15
558 03f6 0C75            add    #12,r5
558 03f6 0C75            add    #12,r5
559 03f8 224F            sts.l    pr,@-r15
559 03f8 224F            sts.l    pr,@-r15
560 03fa 4484            mov.b    @(4,r4),r0
560 03fa 4484            mov.b    @(4,r4),r0
561 03fc 0C63            extu.b    r0,r3
561 03fc 0C63            extu.b    r0,r3
562 03fe 4584            mov.b    @(5,r4),r0
562 03fe 4584            mov.b    @(5,r4),r0
563 0400 0C61            extu.b    r0,r1
563 0400 0C61            extu.b    r0,r1
564 0402 4684            mov.b    @(6,r4),r0
564 0402 4684            mov.b    @(6,r4),r0
565 0404 1861            swap.b    r1,r1
565 0404 1861            swap.b    r1,r1
566 0406 0C60            extu.b    r0,r0
566 0406 0C60            extu.b    r0,r0
567 0408 3B21            or    r3,r1
567 0408 3B21            or    r3,r1
568 040a 2840            shll16    r0
568 040a 2840            shll16    r0
569 040c 0B21            or    r0,r1
569 040c 0B21            or    r0,r1
570 040e 4784            mov.b    @(7,r4),r0
570 040e 4784            mov.b    @(7,r4),r0
571 0410 2840            shll16    r0
571 0410 2840            shll16    r0
572 0412 1840            shll8    r0
572 0412 1840            shll8    r0
573 0414 1B20            or    r1,r0
573 0414 1B20            or    r1,r0
574 0416 0869            swap.b    r0,r9
574 0416 0869            swap.b    r0,r9
575 0418 4884            mov.b    @(8,r4),r0
575 0418 4884            mov.b    @(8,r4),r0
576 041a 9969            swap.w    r9,r9
576 041a 9969            swap.w    r9,r9
577 041c 0C63            extu.b    r0,r3
577 041c 0C63            extu.b    r0,r3
578 041e 4984            mov.b    @(9,r4),r0
578 041e 4984            mov.b    @(9,r4),r0
579 0420 9869            swap.b    r9,r9
579 0420 9869            swap.b    r9,r9
580 0422 0C62            extu.b    r0,r2
580 0422 0C62            extu.b    r0,r2
581 0424 4A84            mov.b    @(10,r4),r0
581 0424 4A84            mov.b    @(10,r4),r0
582 0426 2862            swap.b    r2,r2
582 0426 2862            swap.b    r2,r2
583 0428 0C60            extu.b    r0,r0
583 0428 0C60            extu.b    r0,r0
584 042a 3B22            or    r3,r2
584 042a 3B22            or    r3,r2
585 042c 2840            shll16    r0
585 042c 2840            shll16    r0
586 042e 0B22            or    r0,r2
586 042e 0B22            or    r0,r2
587 0430 4B84            mov.b    @(11,r4),r0
587 0430 4B84            mov.b    @(11,r4),r0
588 0432 12D4            mov.l    .L73,r4
588 0432 12D4            mov.l    .L73,r4
589 0434 2840            shll16    r0
589 0434 2840            shll16    r0
590 0436 1840            shll8    r0
590 0436 1840            shll8    r0
591 0438 2B20            or    r2,r0
591 0438 2B20            or    r2,r0
592 043a 0860            swap.b    r0,r0
592 043a 0860            swap.b    r0,r0
593 043c 0961            swap.w    r0,r1</syntaxhighlight>
593 043c 0961            swap.w    r0,r1
   
   
All of this is just from this simple operation:
All of this is just from this simple operation:
 
<syntaxhighlight lang="c">
  unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_addr = ntohl(command->address);
  unsigned int cmd_size = ntohl(command->size);
unsigned int cmd_size = ntohl(command->size);
</syntaxhighlight>


What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:  
What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:  
 
<syntaxhighlight lang="asm">
  mov.b, zero-extend
mov.b, zero-extend
  mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
  mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
  mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
</syntaxhighlight>


(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)
(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)
Line 111: Line 125:


== Cache Management ==
== Cache Management ==
(Refer to: SH7750 Hardware Manual, Section 4 "Caches" and SH7750 Software Manual, Section 9 "Instruction Descriptions")


Unlike modern processors, where caches are several megabytes in size and can therefore hold entire programs, the SH4 in the Dreamcast only has a 16kB data cache and 8kB instruction cache. Consequently, cache management is very important in order to achieve maximum performance. As is always true of cache optimization, write-back memory mode is required to make much use of it (it's used everywhere by default in [[KallistiOS]] and enabled for P0/U0/P3--but not P1--in [[DreamHAL]]'s startup file).
Unlike modern processors, where caches are several megabytes in size and can therefore hold entire programs, the SH4 in the Dreamcast only has a 16kB data cache and 8kB instruction cache. Consequently, cache management is very important in order to achieve maximum performance. As is always true of cache optimization, write-back memory mode is required to make much use of it (it's used everywhere by default in [[KallistiOS]] and enabled for P0/U0/P3--but not P1--in [[DreamHAL]]'s startup file).
Line 122: Line 138:
* '''ocbwb:''' Write cache block back to external memory, and keep it in the cache
* '''ocbwb:''' Write cache block back to external memory, and keep it in the cache


== Register Allocation ==
== C Function Register Allocation ==
 
(Refer to: SH4 C ABI)
 
(Note: when using GCC 9.x at various optimization levels, like -O3, it tries its best to coalesce output code into this format wherever it can. Of course, if GCC is able to inline a function, parameter-passing becomes a moot point.)
 
The SH4 C ABI specifies that 4 integers (r4-r7) and 8 floats (fr4-fr11) can be passed in registers as function call arguments, and that r0-r3 and fr0-fr3 are also call-clobbered. Passing arguments in registers means that functions can take 4 integers and 8 floats without forcing arguments to be pushed on the stack, saving the cycle penalties that would otherwise occur from stack pushes and associated memory accesses. The call-clobbering of r0-r3 and fr0-fr3 means that those can be used as 4 integer local variables and 4 float variables, as well. Additionally, any of these registers not used for parameters can be repurposed as local variables, so if one only needs to pass in 4 floats to a function, one can then define 4 more local variable floats on top of the 4 we get from fr0-fr3 and they will just use the unused registers.
 
== Pipelining and Instruction-Level Parallelism ==
 
(Refer to: SH7750 Hardware Manual, Section 8 "Pipelining")


The first 8 floats and 4 ints are going to in the registers. Anything else uses the stack.
This section is really only relevant when writing assembly. If you write code in a high-level language like C/C++, compilers [try to] take of this for you and there isn't much you can do about it.


Using struct pointer is also good. You can prefetch it using a local variable. Each read to this struct is gonna cost 2 cycle.
Because of the SH4's dual-issue superscalar design, the CPU preloads two instructions at once, and under the right circumstances these instructions can be executed in parallel. The SH4 architecture organizes various instructions into "instruction groups," and parallel execution primarily occurs when two instructions of different groups are issued together. There are a variety of special cases to this rule of thumb, however, and more advanced code can be structured to take advantage of these properties.


For example, if the two instructions are of different groups but have a dependency chain, the second instruction will stall into the next cycle, and there is also the fact that CO group instructions do not parallelize with anything. Conversely, there are special cases like 0-cycle instructions that can execute in parallel ''despite'' having dependency chains (e.g. a "mov Rn, Rm" followed by an "add #imm8, Rm"), and MT group instructions that can parallelize with other MT group instructions (unless there's a non-special-case dependency chain).


  mrneo240: It's 8 floats?
Appropriate usage of instruction-level parallelism is the only way to achieve >200 MIPS (millions of instructions per second) on a 200MHz SH4.
  [11:14 AM] Moopthehedgehog: yes
  [11:14 AM] DanB91: yea i remember hearing register allocation used to be terrible in gcc but now it's pretty good
  [11:15 AM] Moopthehedgehog: I haven't had any problems with it
  [11:15 AM] Moopthehedgehog: but in addition to those passed in, you get 4 ints to use as local variables
  [11:15 AM] mrneo240: So if I pass 6 args, in an int situation pass struct by pointer but for floats just pass all in arguments
  [11:16 AM] Moopthehedgehog: well, it depends. If the first four are used more than the last two, you can still benefit from passing those first 4 in regs
  [11:16 AM] Moopthehedgehog: in write-back memory the last two will be passed on stack in the cache
  [11:17 AM] Moopthehedgehog: however, if they cause the stack to cross a cacheline, they'll cause a cache fetch, and that's gonna be a hidden perf cost
  [11:17 AM] Moopthehedgehog: if you pass a pointer to a struct, and you prefetch the struct (or it's already in cache), the penalty is instead the standard mov.l cycle count of 2 cycles per
  data unit read
  [11:19 AM] Moopthehedgehog: Any args not passed in can be used as local variables
  [11:20 AM] Moopthehedgehog: In general, r0-r7 and fr0-fr7 are considered always clobbered by functions, so unless you are using higher levels of GCC optimization it's use them or lose them
  [11:20 AM] mrneo240: I've been generally creating the struct then immediately passing by pointer
  [11:20 AM] Moopthehedgehog: that's fine
  [11:20 AM] Moopthehedgehog: they're probably cached in that case
  [11:21 AM] Moopthehedgehog: you only get the 2-cycle penalty the first time they get read from cache into regs</nowiki>


== References ==
== References ==

Latest revision as of 08:21, 14 October 2023

The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many hardware features that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,[1] a couple of 128-bit vector operations on 4x packed 32-bit floats,[2] a memory management unit (MMU), and a direct memory access controller (DMAC).

In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH[3]), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.

The following page is a collection of programming tips and tricks to help with optimizing programs to make full use of the SH4 CPU. This is not meant to be a substitute for reading the SH7750 series hardware and software manuals, rather it should be seen more as an additional reference based on experiences working with the chip (and, in fact, certain terms and hardware-specific concepts assume familiarity with those manuals). Both manuals, "SH7750, SH7750S, SH7750R Group User's Manual: Hardware" and "SH-4 Software Manual," can be downloaded from the "Documents" section of any SH7750 series processor's product page on Renesas's website: https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/superh/sh7750/sh7750r.html, and the SH4 C ABI on STMicroelectronics's website, "RM0197: SH-4 generic and C specific application binary interface," is incredibly handy, too--search for RM0197: https://www.st.com/content/st_com/en.html.

A very convenient SuperH assembly reference can be found here, as well: http://www.shared-ptr.com/sh_insns.html.

This page refers to the documents as follows:

  • SH7750 Hardware Manual: "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"
  • SH7750 Software Manual: "SH-4 Software Manual"
  • SH4 C ABI: "RM0197: SH-4 generic and C specific application binary interface"


Alignment

(Refer to: SH7750 Hardware Manual, Section 5 "Exceptions")

The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.

Take the following example, which defines a packed structure aligned to 4 bytes:

typedef struct __attribute__ ((packed, aligned(4))) {
	unsigned char id[4];
	unsigned int address;
	unsigned int size;
	unsigned char data[]; // Flexible array member
} command_t;

Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.

Doing this:

unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);

Produces this output (GCC 9.2.0):

420 02e4 862F     		mov.l	r8,@-r15
421 02e6 4365     		mov	r4,r5
422 02e8 962F     		mov.l	r9,@-r15
423 02ea 0C75     		add	#12,r5
424 02ec 224F     		sts.l	pr,@-r15
425 02ee 4159     		mov.l	@(4,r4),r9
426 02f0 4256     		mov.l	@(8,r4),r6
427 02f2 12D0     		mov.l	.L71,r0
428 02f4 9869     		swap.b	r9,r9
429 02f6 6866     		swap.b	r6,r6
430 02f8 11D4     		mov.l	.L72,r4
431 02fa 6966     		swap.w	r6,r6
432 02fc 9969     		swap.w	r9,r9
433 02fe 9869     		swap.b	r9,r9
434 0300 6868     		swap.b	r6,r8

But what if it weren't aligned to 4 bytes? Just this:

typedef struct __attribute__ ((packed)) {
	unsigned char id[4];
	unsigned int address;
	unsigned int size;
	unsigned char data[]; // Flexible array member
} command_t;

Accessing the data looks like this, in that case (GCC 9.2.0):

555 03f0 862F             mov.l    r8,@-r15
556 03f2 4365             mov    r4,r5
557 03f4 962F             mov.l    r9,@-r15
558 03f6 0C75             add    #12,r5
559 03f8 224F             sts.l    pr,@-r15
560 03fa 4484             mov.b    @(4,r4),r0
561 03fc 0C63             extu.b    r0,r3
562 03fe 4584             mov.b    @(5,r4),r0
563 0400 0C61             extu.b    r0,r1
564 0402 4684             mov.b    @(6,r4),r0
565 0404 1861             swap.b    r1,r1
566 0406 0C60             extu.b    r0,r0
567 0408 3B21             or    r3,r1
568 040a 2840             shll16    r0
569 040c 0B21             or    r0,r1
570 040e 4784             mov.b    @(7,r4),r0
571 0410 2840             shll16    r0
572 0412 1840             shll8    r0
573 0414 1B20             or    r1,r0
574 0416 0869             swap.b    r0,r9
575 0418 4884             mov.b    @(8,r4),r0
576 041a 9969             swap.w    r9,r9
577 041c 0C63             extu.b    r0,r3
578 041e 4984             mov.b    @(9,r4),r0
579 0420 9869             swap.b    r9,r9
580 0422 0C62             extu.b    r0,r2
581 0424 4A84             mov.b    @(10,r4),r0
582 0426 2862             swap.b    r2,r2
583 0428 0C60             extu.b    r0,r0
584 042a 3B22             or    r3,r2
585 042c 2840             shll16    r0
586 042e 0B22             or    r0,r2
587 0430 4B84             mov.b    @(11,r4),r0
588 0432 12D4             mov.l    .L73,r4
589 0434 2840             shll16    r0
590 0436 1840             shll8    r0
591 0438 2B20             or    r2,r0
592 043a 0860             swap.b    r0,r0
593 043c 0961             swap.w    r0,r1

All of this is just from this simple operation:

unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);

What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:

mov.b, zero-extend
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add

(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)

Considering that 1x mov.b takes the same amount of time as 1x mov.l, plus all the other operations that must be done to build the 4-byte data out of 1-byte accesses, it's easy to see how big the performance hit from mismanaging alignment can be!

Cache Management

(Refer to: SH7750 Hardware Manual, Section 4 "Caches" and SH7750 Software Manual, Section 9 "Instruction Descriptions")

Unlike modern processors, where caches are several megabytes in size and can therefore hold entire programs, the SH4 in the Dreamcast only has a 16kB data cache and 8kB instruction cache. Consequently, cache management is very important in order to achieve maximum performance. As is always true of cache optimization, write-back memory mode is required to make much use of it (it's used everywhere by default in KallistiOS and enabled for P0/U0/P3--but not P1--in DreamHAL's startup file).

Half of the data cache can also be used as a form of high-speed RAM (referred to as OCRAM), but in most cases programs should stick to using the full cache size for cache purposes. The SH4 uses a direct-mapped cache, meaning that there is only one cache entry for every 16kB memory chunk (half that when used in OCRAM mode) and cache trashing can happen if trying to do something like copy data from some address to a destination address that is an integer multiple of the cache size away from that address (e.g. source = address offset 8 and destination = address 16kB + 8).

The SH4 provides the following instructions in addition to the two 32-byte "store queues" (SQs) to make efficient use of the cache:

  • movca.l: Store register data to cache, if there's a cache miss just allocate a cache block and write to it without first reading that cache block from memory
  • ocbp: Purge cache block; write back cache block and invalidate it
  • ocbi: Invalidate cache block without writing it back
  • ocbwb: Write cache block back to external memory, and keep it in the cache

C Function Register Allocation

(Refer to: SH4 C ABI)

(Note: when using GCC 9.x at various optimization levels, like -O3, it tries its best to coalesce output code into this format wherever it can. Of course, if GCC is able to inline a function, parameter-passing becomes a moot point.)

The SH4 C ABI specifies that 4 integers (r4-r7) and 8 floats (fr4-fr11) can be passed in registers as function call arguments, and that r0-r3 and fr0-fr3 are also call-clobbered. Passing arguments in registers means that functions can take 4 integers and 8 floats without forcing arguments to be pushed on the stack, saving the cycle penalties that would otherwise occur from stack pushes and associated memory accesses. The call-clobbering of r0-r3 and fr0-fr3 means that those can be used as 4 integer local variables and 4 float variables, as well. Additionally, any of these registers not used for parameters can be repurposed as local variables, so if one only needs to pass in 4 floats to a function, one can then define 4 more local variable floats on top of the 4 we get from fr0-fr3 and they will just use the unused registers.

Pipelining and Instruction-Level Parallelism

(Refer to: SH7750 Hardware Manual, Section 8 "Pipelining")

This section is really only relevant when writing assembly. If you write code in a high-level language like C/C++, compilers [try to] take of this for you and there isn't much you can do about it.

Because of the SH4's dual-issue superscalar design, the CPU preloads two instructions at once, and under the right circumstances these instructions can be executed in parallel. The SH4 architecture organizes various instructions into "instruction groups," and parallel execution primarily occurs when two instructions of different groups are issued together. There are a variety of special cases to this rule of thumb, however, and more advanced code can be structured to take advantage of these properties.

For example, if the two instructions are of different groups but have a dependency chain, the second instruction will stall into the next cycle, and there is also the fact that CO group instructions do not parallelize with anything. Conversely, there are special cases like 0-cycle instructions that can execute in parallel despite having dependency chains (e.g. a "mov Rn, Rm" followed by an "add #imm8, Rm"), and MT group instructions that can parallelize with other MT group instructions (unless there's a non-special-case dependency chain).

Appropriate usage of instruction-level parallelism is the only way to achieve >200 MIPS (millions of instructions per second) on a 200MHz SH4.

References

  1. It's predominantly used for single-precision operations: it can do doubles, but that doesn't mean it's a great idea!
  2. See fipr, ftrv: http://www.shared-ptr.com/sh_insns.html
  3. https://lwn.net/Articles/647636/