SH4 in Compiler Explorer: Difference between revisions

From dreamcast.wiki
Jump to navigation Jump to search
No edit summary
(15 intermediate revisions by the same user not shown)
Line 2: Line 2:


= Configuration =
= Configuration =
[[File:SH GCC Compiler Explorer.png|thumb|Dreamcast-like SuperH GCC Compiler Configuration]]
[[File:SH4 Compiler Explorer Configuration.png|thumb|SH GCC Toolchain configured for the Dreamcast's SH4 CPU in Compiler Explorer]]
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:  
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:  
* -ml: Compile code for the processor in little-endian mode
* <code>-ml</code>: Compile code for the processor in little-endian mode
* -m4-single-only: Generate code for the SH4 with a floating-point unit that only supports single-precision arithmetic
* <code>-m4-single-only</code>: Generate code for the SH4 with a floating-point unit that only supports single-precision arithmetic
* -ffast-math: Breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-ffast-math</code>: Breaks strict IEEE compliance and allows for faster floating point approximations
* -O3: optimization level 3  
* <code>-O3</code>: optimization level 3  
* -mfsrra: enables emission of the fsrra instruction for reciprocal square root approximations.
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4).
* -mfsca: enables emission of the fsca instruction for sine and cosine approximations
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4).
 
= Convenience Templates =
[[File:GCC Compiler Benchmarks.png|thumb|Runtime performance and binary sizes for different GCC versions using various compiler flags on KOS's pvrmark example: [https://dcemulation.org/phpBB/viewtopic.php?p=1059978#p1059978 Source]]]
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]


= Tips and Notes =  
= Tips and Notes =  
* It has been noted that while -O3 is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be scene under certain circumstances when using -O4 and beyond.  
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.  
* The compiler seems to ignore both -mfsrra and -mfsca without the -ffast-math and -m4-single-only options.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> and <code>-m4-single-only</code> options.
* It is highly recommended that C code is written to use -mfsrra (1.0/sqrt(N)) and -mfsca (builtin sin/cos) over using inline assembly directly, as this seems to give the compiler more context for code optimization around these instructions.
* It is highly recommended that C code is written to use <code>-mfsrra</code> (1.0/sqrt(N)) and <code>-mfsca</code> (builtin sin/cos) over using inline assembly directly, as this seems to give the compiler more context for code optimization around these instructions.
* The __builtin_prefetch intrinsic does seem to generate a single "pref" instruction and should be preferred over inline assembly.
* The <code>__builtin_prefetch</code> intrinsic does seem to generate a single "pref" instruction and should be preferred over inline assembly.
* The compiler does not seem smart enough to utilize the FIPR (inner/dot product), FMAC (multiply and accumulate), or FTRV (transform vector) instructions regardless of how embarrassingly vectorizable the supplied C code seems to be, so linear algebra routines are forced to use inline assembly to fully leverage the SH4's SIMD instructions.
* The compiler does not seem smart enough to utilize the FIPR (inner/dot product), FMAC (multiply and accumulate), or FTRV (transform vector) instructions regardless of how embarrassingly vectorizable the supplied C code seems to be, so linear algebra routines are forced to use inline assembly to fully leverage the SH4's SIMD instructions.
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

Revision as of 05:16, 27 May 2023

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with Compiler Explorer, along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

Configuration

SH GCC Toolchain configured for the Dreamcast's SH4 CPU in Compiler Explorer

To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:

  • -ml: Compile code for the processor in little-endian mode
  • -m4-single-only: Generate code for the SH4 with a floating-point unit that only supports single-precision arithmetic
  • -ffast-math: Breaks strict IEEE compliance and allows for faster floating point approximations
  • -O3: optimization level 3
  • -mfsrra: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4).
  • -mfsca: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4).

Convenience Templates

Runtime performance and binary sizes for different GCC versions using various compiler flags on KOS's pvrmark example: Source

The following are pre-configured templates you can use as sample Dreamcast build configurations:

Tips and Notes

  • It has been noted that while -O3 is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using -O4 and beyond.
  • The compiler seems to ignore both -mfsrra and -mfsca without the -ffast-math and -m4-single-only options.
  • It is highly recommended that C code is written to use -mfsrra (1.0/sqrt(N)) and -mfsca (builtin sin/cos) over using inline assembly directly, as this seems to give the compiler more context for code optimization around these instructions.
  • The __builtin_prefetch intrinsic does seem to generate a single "pref" instruction and should be preferred over inline assembly.
  • The compiler does not seem smart enough to utilize the FIPR (inner/dot product), FMAC (multiply and accumulate), or FTRV (transform vector) instructions regardless of how embarrassingly vectorizable the supplied C code seems to be, so linear algebra routines are forced to use inline assembly to fully leverage the SH4's SIMD instructions.
  • Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
  • Evidently, even without a branch predictor, the C++20 [[likely]] and [[unlikely]] attributes as well as the GCC intrinsic __builtin_expect() can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found here.
  • -fipa-pta allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
  • -flto allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
  • An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found here.