Useful programming tips: Difference between revisions
| No edit summary | |||
| Line 21: | Line 21: | ||
| Take the following example, which defines a packed structure aligned to 4 bytes: | Take the following example, which defines a packed structure aligned to 4 bytes: | ||
| <syntaxhighlight lang="c"> | |||
| typedef struct __attribute__ ((packed, aligned(4))) { | |||
| 	unsigned char id[4]; | |||
| 	unsigned int address; | |||
| 	unsigned int size; | |||
| 	unsigned char data[]; // Flexible array member | |||
| } command_t; | |||
| </syntaxhighlight> | |||
| Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes. | Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes. | ||
| Doing this: | Doing this: | ||
| <syntaxhighlight lang="c"> | |||
| unsigned int cmd_addr = ntohl(command->address); | |||
| unsigned int cmd_size = ntohl(command->size); | |||
| </syntaxhighlight> | |||
| Produces this output (GCC 9.2.0): | Produces this output (GCC 9.2.0): | ||
| {| | |||
| | <syntaxhighlight inline>420 02e4 862F</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l r8,@-r15</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>421 02e6 4365</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov r4,r5</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>422 02e8 962F</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l r9,@-r15</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>423 02ea 0C75</syntaxhighlight> || <syntaxhighlight lang="asm" inline>add #12,r5</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>424 02ec 224F</syntaxhighlight> || <syntaxhighlight lang="asm" inline>sts.l pr,@-r15</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>425 02ee 4159</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l @(4,r4),r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>426 02f0 4256</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l @(8,r4),r6</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>427 02f2 12D0</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l .L71,r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>428 02f4 9869</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r9,r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>429 02f6 6866</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r6,r6</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>430 02f8 11D4</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l .L72,r4</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>431 02fa 6966</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.w r6,r6</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>432 02fc 9969</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.w r9,r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>433 02fe 9869</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r9,r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>434 0300 6868</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r6,r8</syntaxhighlight> | |||
| |} | |||
| But what if it weren't aligned to 4 bytes? Just this: | But what if it weren't aligned to 4 bytes? Just this: | ||
| <syntaxhighlight lang="c"> | |||
| typedef struct __attribute__ ((packed)) { | |||
| 	unsigned char id[4]; | |||
| 	unsigned int address; | |||
| 	unsigned int size; | |||
| 	unsigned char data[]; // Flexible array member | |||
| } command_t; | |||
| </syntaxhighlight> | |||
| Accessing the data looks like this, in that case (GCC 9.2.0): | Accessing the data looks like this, in that case (GCC 9.2.0): | ||
| {| | |||
| | <syntaxhighlight inline>555 03f0 862F</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l r8,@-r15</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>556 03f2 4365</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov r4,r5</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>557 03f4 962F</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l r9,@-r15</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>558 03f6 0C75</syntaxhighlight> || <syntaxhighlight lang="asm" inline>add #12,r5</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>559 03f8 224F</syntaxhighlight> || <syntaxhighlight lang="asm" inline>sts.l pr,@-r15</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>560 03fa 4484</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(4,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>561 03fc 0C63</syntaxhighlight> || <syntaxhighlight lang="asm" inline>extu.b r0,r3</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>562 03fe 4584</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(5,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>563 0400 0C61</syntaxhighlight> || <syntaxhighlight lang="asm" inline>extu.b r0,r1</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>564 0402 4684</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(6,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>565 0404 1861</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r1,r1</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>566 0406 0C60</syntaxhighlight> || <syntaxhighlight lang="asm" inline>extu.b r0,r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>567 0408 3B21</syntaxhighlight> || <syntaxhighlight lang="asm" inline>or r3,r1</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>568 040a 2840</syntaxhighlight> || <syntaxhighlight lang="asm" inline>shll16 r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>569 040c 0B21</syntaxhighlight> || <syntaxhighlight lang="asm" inline>or r0,r1</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>570 040e 4784</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(7,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>571 0410 2840</syntaxhighlight> || <syntaxhighlight lang="asm" inline>shll16 r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>572 0412 1840</syntaxhighlight> || <syntaxhighlight lang="asm" inline>shll8 r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>573 0414 1B20</syntaxhighlight> || <syntaxhighlight lang="asm" inline>or r1,r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>574 0416 0869</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r0,r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>575 0418 4884</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(8,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>576 041a 9969</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.w r9,r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>577 041c 0C63</syntaxhighlight> || <syntaxhighlight lang="asm" inline>extu.b r0,r3</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>578 041e 4984</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(9,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>579 0420 9869</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r9,r9</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>580 0422 0C62</syntaxhighlight> || <syntaxhighlight lang="asm" inline>extu.b r0,r2</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>581 0424 4A84</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(10,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>582 0426 2862</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r2,r2</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>583 0428 0C60</syntaxhighlight> || <syntaxhighlight lang="asm" inline>extu.b r0,r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>584 042a 3B22</syntaxhighlight> || <syntaxhighlight lang="asm" inline>or r3,r2</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>585 042c 2840</syntaxhighlight> || <syntaxhighlight lang="asm" inline>shll16 r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>586 042e 0B22</syntaxhighlight> || <syntaxhighlight lang="asm" inline>or r0,r2</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>587 0430 4B84</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.b @(11,r4),r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>588 0432 12D4</syntaxhighlight> || <syntaxhighlight lang="asm" inline>mov.l .L73,r4</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>589 0434 2840</syntaxhighlight> || <syntaxhighlight lang="asm" inline>shll16 r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>590 0436 1840</syntaxhighlight> || <syntaxhighlight lang="asm" inline>shll8 r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>591 0438 2B20</syntaxhighlight> || <syntaxhighlight lang="asm" inline>or r2,r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>592 043a 0860</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.b r0,r0</syntaxhighlight> | |||
| |- | |||
| | <syntaxhighlight inline>593 043c 0961</syntaxhighlight> || <syntaxhighlight lang="asm" inline>swap.w r0,r1</syntaxhighlight> | |||
| |} | |||
| All of this is just from this simple operation: | All of this is just from this simple operation: | ||
| <syntaxhighlight lang="c"> | |||
| unsigned int cmd_addr = ntohl(command->address); | |||
| unsigned int cmd_size = ntohl(command->size); | |||
| </syntaxhighlight> | |||
| What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:   | What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:   | ||
| <syntaxhighlight lang="asm"> | |||
| mov.b, zero-extend | |||
| mov.b, zero-extend, shift, add | |||
| mov.b, zero-extend, shift, add | |||
| mov.b, zero-extend, shift, add | |||
| </syntaxhighlight> | |||
| (Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.) | (Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.) | ||
Revision as of 17:41, 30 May 2020
The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many programming paradigms that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,[1] a couple of 128-bit vector operations on 4x packed 32-bit floats,[2] a memory management unit (MMU), and a direct memory access controller (DMAC).
In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH[3]), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.
The following page is a collection of programming tips and tricks to help with optimizing programs to make full use of the SH4 CPU. This is not meant to be a substitute for reading the SH7750 series hardware and software manuals, rather it should be seen more as an additional reference based on experiences working with the chip (and, in fact, certain terms and hardware-specific concepts assume familiarity with those manuals). Both manuals, "SH7750, SH7750S, SH7750R Group User's Manual: Hardware" and "SH-4 Software Manual," can be downloaded from the "Documents" section of any SH7750 series processor's product page on Renesas's website: https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/superh/sh7750/sh7750r.html, and the SH4 C ABI on STMicroelectronics's website, "RM0197: SH-4 generic and C specific application binary interface," is incredibly handy, too--search for RM0197: https://www.st.com/content/st_com/en.html.
A very convenient SuperH assembly reference can be found here, as well: http://www.shared-ptr.com/sh_insns.html.
This page refers to the documents as follows:
- SH7750 Hardware Manual: "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"
- SH7750 Software Manual: "SH-4 Software Manual"
- SH4 C ABI: "RM0197: SH-4 generic and C specific application binary interface"
Alignment
(Refer to: SH7750 Hardware Manual, Section 5 "Exceptions")
The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.
Take the following example, which defines a packed structure aligned to 4 bytes:
typedef struct __attribute__ ((packed, aligned(4))) {
	unsigned char id[4];
	unsigned int address;
	unsigned int size;
	unsigned char data[]; // Flexible array member
} command_t;
Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.
Doing this:
unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);
Produces this output (GCC 9.2.0):
| 420 02e4 862F | mov.l r8,@-r15 | 
| 421 02e6 4365 | mov r4,r5 | 
| 422 02e8 962F | mov.l r9,@-r15 | 
| 423 02ea 0C75 | add #12,r5 | 
| 424 02ec 224F | sts.l pr,@-r15 | 
| 425 02ee 4159 | mov.l @(4,r4),r9 | 
| 426 02f0 4256 | mov.l @(8,r4),r6 | 
| 427 02f2 12D0 | mov.l .L71,r0 | 
| 428 02f4 9869 | swap.b r9,r9 | 
| 429 02f6 6866 | swap.b r6,r6 | 
| 430 02f8 11D4 | mov.l .L72,r4 | 
| 431 02fa 6966 | swap.w r6,r6 | 
| 432 02fc 9969 | swap.w r9,r9 | 
| 433 02fe 9869 | swap.b r9,r9 | 
| 434 0300 6868 | swap.b r6,r8 | 
But what if it weren't aligned to 4 bytes? Just this:
typedef struct __attribute__ ((packed)) {
	unsigned char id[4];
	unsigned int address;
	unsigned int size;
	unsigned char data[]; // Flexible array member
} command_t;
Accessing the data looks like this, in that case (GCC 9.2.0):
| 555 03f0 862F | mov.l r8,@-r15 | 
| 556 03f2 4365 | mov r4,r5 | 
| 557 03f4 962F | mov.l r9,@-r15 | 
| 558 03f6 0C75 | add #12,r5 | 
| 559 03f8 224F | sts.l pr,@-r15 | 
| 560 03fa 4484 | mov.b @(4,r4),r0 | 
| 561 03fc 0C63 | extu.b r0,r3 | 
| 562 03fe 4584 | mov.b @(5,r4),r0 | 
| 563 0400 0C61 | extu.b r0,r1 | 
| 564 0402 4684 | mov.b @(6,r4),r0 | 
| 565 0404 1861 | swap.b r1,r1 | 
| 566 0406 0C60 | extu.b r0,r0 | 
| 567 0408 3B21 | or r3,r1 | 
| 568 040a 2840 | shll16 r0 | 
| 569 040c 0B21 | or r0,r1 | 
| 570 040e 4784 | mov.b @(7,r4),r0 | 
| 571 0410 2840 | shll16 r0 | 
| 572 0412 1840 | shll8 r0 | 
| 573 0414 1B20 | or r1,r0 | 
| 574 0416 0869 | swap.b r0,r9 | 
| 575 0418 4884 | mov.b @(8,r4),r0 | 
| 576 041a 9969 | swap.w r9,r9 | 
| 577 041c 0C63 | extu.b r0,r3 | 
| 578 041e 4984 | mov.b @(9,r4),r0 | 
| 579 0420 9869 | swap.b r9,r9 | 
| 580 0422 0C62 | extu.b r0,r2 | 
| 581 0424 4A84 | mov.b @(10,r4),r0 | 
| 582 0426 2862 | swap.b r2,r2 | 
| 583 0428 0C60 | extu.b r0,r0 | 
| 584 042a 3B22 | or r3,r2 | 
| 585 042c 2840 | shll16 r0 | 
| 586 042e 0B22 | or r0,r2 | 
| 587 0430 4B84 | mov.b @(11,r4),r0 | 
| 588 0432 12D4 | mov.l .L73,r4 | 
| 589 0434 2840 | shll16 r0 | 
| 590 0436 1840 | shll8 r0 | 
| 591 0438 2B20 | or r2,r0 | 
| 592 043a 0860 | swap.b r0,r0 | 
| 593 043c 0961 | swap.w r0,r1 | 
All of this is just from this simple operation:
unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);
What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:
mov.b, zero-extend
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)
Considering that 1x mov.b takes the same amount of time as 1x mov.l, plus all the other operations that must be done to build the 4-byte data out of 1-byte accesses, it's easy to see how big the performance hit from mismanaging alignment can be!
Cache Management
(Refer to: SH7750 Hardware Manual, Section 4 "Caches" and SH7750 Software Manual, Section 9 "Instruction Descriptions")
Unlike modern processors, where caches are several megabytes in size and can therefore hold entire programs, the SH4 in the Dreamcast only has a 16kB data cache and 8kB instruction cache. Consequently, cache management is very important in order to achieve maximum performance. As is always true of cache optimization, write-back memory mode is required to make much use of it (it's used everywhere by default in KallistiOS and enabled for P0/U0/P3--but not P1--in DreamHAL's startup file).
Half of the data cache can also be used as a form of high-speed RAM (referred to as OCRAM), but in most cases programs should stick to using the full cache size for cache purposes. The SH4 uses a direct-mapped cache, meaning that there is only one cache entry for every 16kB memory chunk (half that when used in OCRAM mode) and cache trashing can happen if trying to do something like copy data from some address to a destination address that is an integer multiple of the cache size away from that address (e.g. source = address offset 8 and destination = address 16kB + 8).
The SH4 provides the following instructions in addition to the two 32-byte "store queues" (SQs) to make efficient use of the cache:
- movca.l: Store register data to cache, if there's a cache miss just allocate a cache block and write to it without first reading that cache block from memory
- ocbp: Purge cache block; write back cache block and invalidate it
- ocbi: Invalidate cache block without writing it back
- ocbwb: Write cache block back to external memory, and keep it in the cache
C Function Register Allocation
(Refer to: SH4 C ABI)
(Note: when using GCC 9.x at various optimization levels, like -O3, it tries its best to coalesce output code into this format wherever it can. Of course, if GCC is able to inline a function, parameter-passing becomes a moot point.)
The SH4 C ABI specifies that 4 integers (r4-r7) and 8 floats (fr4-fr11) can be passed in registers as function call arguments, and that r0-r3 and fr0-fr3 are also call-clobbered. Passing arguments in registers means that functions can take 4 integers and 8 floats without forcing arguments to be pushed on the stack, saving the cycle penalties that would otherwise occur from stack pushes and associated memory accesses. The call-clobbering of r0-r3 and fr0-fr3 means that those can be used as 4 integer local variables and 4 float variables, as well. Additionally, any of these registers not used for parameters can be repurposed as local variables, so if one only needs to pass in 4 floats to a function, one can then define 4 more local variable floats on top of the 4 we get from fr0-fr3 and they will just use the unused registers.
Pipelining and Instruction-Level Parallelism
(Refer to: SH7750 Hardware Manual, Section 8 "Pipelining")
This section is really only relevant when writing assembly. If you write code in a high-level language like C/C++, compilers [try to] take of this for you and there isn't much you can do about it.
Because of the SH4's dual-issue superscalar design, the CPU preloads two instructions at once, and under the right circumstances these instructions can be executed in parallel. The SH4 architecture organizes various instructions into "instruction groups," and parallel execution primarily occurs when two instructions of different groups are issued together. There are a variety of special cases to this rule of thumb, however, and more advanced code can be structured to take advantage of these properties.
For example, if the two instructions are of different groups but have a dependency chain, the second instruction will stall into the next cycle, and there is also the fact that CO group instructions do not parallelize with anything. Conversely, there are special cases like 0-cycle instructions that can execute in parallel despite having dependency chains (e.g. a "mov Rn, Rm" followed by an "add #imm8, Rm"), and MT group instructions that can parallelize with other MT group instructions (unless there's a non-special-case dependency chain).
Appropriate usage of instruction-level parallelism is the only way to achieve >200 MIPS (millions of instructions per second) on a 200MHz SH4.
References
- ↑ It's predominantly used for single-precision operations: it can do doubles, but that doesn't mean it's a great idea!
- ↑ See fipr, ftrv: http://www.shared-ptr.com/sh_insns.html
- ↑ https://lwn.net/Articles/647636/