Useful programming tips

From dreamcast.wiki
Jump to navigation Jump to search

The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many programming paradigms that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,[1] a couple of 128-bit vector operations on 4x packed 32-bit floats,[2] a memory management unit (MMU), and a direct memory access controller (DMAC).

In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH[3]), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.

Alignment

The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.

Take the following example, which defines a packed structure aligned to 4 bytes:

 typedef struct __attribute__ ((packed, aligned(4))) {
 	unsigned char id[4];
 	unsigned int address;
 	unsigned int size;
 	unsigned char data[]; // Flexible array member
 } command_t;

Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.

Doing this:

 unsigned int cmd_addr = ntohl(command->address);
 unsigned int cmd_size = ntohl(command->size);

Produces this output (GCC 9.2.0):

420 02e4 862F     		mov.l	r8,@-r15
421 02e6 4365     		mov	r4,r5
422 02e8 962F     		mov.l	r9,@-r15
423 02ea 0C75     		add	#12,r5
424 02ec 224F     		sts.l	pr,@-r15
425 02ee 4159     		mov.l	@(4,r4),r9
426 02f0 4256     		mov.l	@(8,r4),r6
427 02f2 12D0     		mov.l	.L71,r0
428 02f4 9869     		swap.b	r9,r9
429 02f6 6866     		swap.b	r6,r6
430 02f8 11D4     		mov.l	.L72,r4
431 02fa 6966     		swap.w	r6,r6
432 02fc 9969     		swap.w	r9,r9
433 02fe 9869     		swap.b	r9,r9
434 0300 6868     		swap.b	r6,r8

But what if it weren't aligned to 4 bytes? Just this:

 typedef struct __attribute__ ((packed)) {
 	unsigned char id[4];
 	unsigned int address;
 	unsigned int size;
 	unsigned char data[]; // Flexible array member
 } command_t;

Accessing the data looks like this, in that case (GCC 9.2.0):

555 03f0 862F             mov.l    r8,@-r15
556 03f2 4365             mov    r4,r5
557 03f4 962F             mov.l    r9,@-r15
558 03f6 0C75             add    #12,r5
559 03f8 224F             sts.l    pr,@-r15
560 03fa 4484             mov.b    @(4,r4),r0
561 03fc 0C63             extu.b    r0,r3
562 03fe 4584             mov.b    @(5,r4),r0
563 0400 0C61             extu.b    r0,r1
564 0402 4684             mov.b    @(6,r4),r0
565 0404 1861             swap.b    r1,r1
566 0406 0C60             extu.b    r0,r0
567 0408 3B21             or    r3,r1
568 040a 2840             shll16    r0
569 040c 0B21             or    r0,r1
570 040e 4784             mov.b    @(7,r4),r0
571 0410 2840             shll16    r0
572 0412 1840             shll8    r0
573 0414 1B20             or    r1,r0
574 0416 0869             swap.b    r0,r9
575 0418 4884             mov.b    @(8,r4),r0
576 041a 9969             swap.w    r9,r9
577 041c 0C63             extu.b    r0,r3
578 041e 4984             mov.b    @(9,r4),r0
579 0420 9869             swap.b    r9,r9
580 0422 0C62             extu.b    r0,r2
581 0424 4A84             mov.b    @(10,r4),r0
582 0426 2862             swap.b    r2,r2
583 0428 0C60             extu.b    r0,r0
584 042a 3B22             or    r3,r2
585 042c 2840             shll16    r0
586 042e 0B22             or    r0,r2
587 0430 4B84             mov.b    @(11,r4),r0
588 0432 12D4             mov.l    .L73,r4
589 0434 2840             shll16    r0
590 0436 1840             shll8    r0
591 0438 2B20             or    r2,r0
592 043a 0860             swap.b    r0,r0
593 043c 0961             swap.w    r0,r1

All of this is just from this simple operation:

 unsigned int cmd_addr = ntohl(command->address);
 unsigned int cmd_size = ntohl(command->size);

What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:

 mov.b, zero-extend
 mov.b, zero-extend, shift, add
 mov.b, zero-extend, shift, add
 mov.b, zero-extend, shift, add

(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)

Considering that 1x mov.b takes the same amount of time as 1x mov.l, plus all the other operations that must be done to build the 4-byte data out of 1-byte accesses, it's easy to see how big the performance hit from mismanaging alignment can be!

Cache Management

Functions and registers

The first 8 floats and 4 ints are going to in the registers. Anything else uses the stack.

Using struct pointer is also good. You can prefetch it using a local variable. Each read to this struct is gonna cost 2 cycle.


 mrneo240: It's 8 floats?
 [11:14 AM] Moopthehedgehog: yes
 [11:14 AM] DanB91: yea i remember hearing register allocation used to be terrible in gcc but now it's pretty good
 [11:15 AM] Moopthehedgehog: I haven't had any problems with it
 [11:15 AM] Moopthehedgehog: but in addition to those passed in, you get 4 ints to use as local variables
 [11:15 AM] mrneo240: So if I pass 6 args, in an int situation pass struct by pointer but for floats just pass all in arguments
 [11:16 AM] Moopthehedgehog: well, it depends. If the first four are used more than the last two, you can still benefit from passing those first 4 in regs
 [11:16 AM] Moopthehedgehog: in write-back memory the last two will be passed on stack in the cache
 [11:17 AM] Moopthehedgehog: however, if they cause the stack to cross a cacheline, they'll cause a cache fetch, and that's gonna be a hidden perf cost
 [11:17 AM] Moopthehedgehog: if you pass a pointer to a struct, and you prefetch the struct (or it's already in cache), the penalty is instead the standard mov.l cycle count of 2 cycles per 
 data unit read
 [11:19 AM] Moopthehedgehog: Any args not passed in can be used as local variables
 [11:20 AM] Moopthehedgehog: In general, r0-r7 and fr0-fr7 are considered always clobbered by functions, so unless you are using higher levels of GCC optimization it's use them or lose them
 [11:20 AM] mrneo240: I've been generally creating the struct then immediately passing by pointer
 [11:20 AM] Moopthehedgehog: that's fine
 [11:20 AM] Moopthehedgehog: they're probably cached in that case
 [11:21 AM] Moopthehedgehog: you only get the 2-cycle penalty the first time they get read from cache into regs</nowiki>

References

  1. It's predominantly used for single-precision operations: it can do doubles, but that doesn't mean it's a great idea!
  2. See fipr, ftrv: http://www.shared-ptr.com/sh_insns.html
  3. https://lwn.net/Articles/647636/