------------------------------------------- Helpful tips and tricks - using GCC and G++ ------------------------------------------- => Use struct copies (instead of copying individual elements of a struct). GCC and G++ generate code with weak scheduling when copying a struct by individual elements. GCC and G++ generate code with better instruction scheduling when copying a struct via struct assignment. => Use C++ "inline" methods sparingly Many people overuse the C++ "inline" facility because it eliminates a function call and therefore "always makes code faster". This isn't necessarily true. When a C++ method which requires many registers is inlined into another function which uses many registers, this creates a code block which requires an extremely large number of registers. This can overload G++'s register allocator, and force it to generate many register spills/restores to/from the stack frame (e.g. generate slow code). If the two functions were compiled separately, then GCC could probably generate more efficient code for both functions. Good C++ methods for inlining are usually: o Small Small functions are good for inlining because the cost of the function call is fairly large compared to the work performed. Large functions are usually not worth inlining because the cost of the function call is small compared to the actual work performed. o Use only a few variables (registers) So you don't overload the register allocator, as described above. => I see an "Out of memory" dialogue box when loading my executables into Codescape. Probably you have compiled for big-endian instead of little-endian... => Passing long command lines to GCC/AS/LD GCC supports the Microsoft convention of using "@filename" to pass long command lines. So, instead of invoking "gcc -option1 -option2 ..." you can put "-option1 -option2" into a file, and invoke "gcc @filename" instead. => Useful compiler options GCC supports three calling conventions on the SH4. These are: -m4 FPU is assumed double-precision on function entry and exit. single = 32 bits, double = 64 bits. -m4-single FPU is assumed single-precision on function entry and exit. single = 32 bits, double = 64 bits. -m4-single-only FPU is always in single-precision mode. single = double = 32 bits. This mode is the fastest so we recommend using it over -m4 or -m4-single. A few other useful options: -Wall Turns on all warnings. This will detect uninitialized variables, unused variables, etc. Generally a Good Thing. -ansi Turns on ANSI C compliance warnings. For strict ANSI compliance (no GNU C extensions) use -pedantic instead (somewhat deprecated). -mrelax (C and C++, bigger win on C++) Inserts branch shortening hints for shortening at link time. Can convert some mova Lfoo, rn; jsr @rn; Lfoo .long func to bsr func which saves six bytes. Note: At source compile time, the -mrelax option must be used. At link time, if using gcc -mrelax must be used. If linking with ld, then -relax must be used. Note: As of this writing -mrelax does not update the debug info properly - it causes source lines not to line up properly with the disassembly. If this happens, please do not use. -mspace (C and C++) Requests GCC to minimize code size. Tells gcc not to unroll memory moves/struct copies and minimizes CSE movement. Usually extracts about a 5% penalty on code speed, but very useful when you're tight on memory. Note: Optimizing for space with -mspace disables inlining of __inline__ functions! (for obvious reasons) -ffast-math (C and C++) Prevents GCC from generating calls to software sqrt, etc when -m4-single-only is used. -mdalign (C and C++) Preserves 64-bit alignment of the stack pointer. This allows hand-written assembly to use fmov.d to access doubles on the stack. Note: Neither libc nor libcross are fully compatible with this option, so great care will be needed when using this option. -mhitachi (C and C++) Uses the Hitachi C calling convention instead of the standard calling convention. The differences are: o GNU C will pass structures in registers if possible. Hitachi C always passes structures on the stack. o GNU C can pass some varargs parameters in registers. Hitachi C always passes all varargs parameters on the stack. o GNU C assumes MACL and MACH are clobbered across function calls. Hitachi C assumes MACL and MACH are preserved. Note: You must use the Hitachi C standard C libraries when using this option, because the GNU C libraries assume the standard calling convention. Note: You must properly prototype varargs/stdargs functions when using -mhitachi. Failure to prototype these functions will cause the callee to receive garbage data. -fstrict-aliasing This option will become available in early to mid '99 and improves alias analysis. -fargument-noalias -fargument-noalias-global (C and C++; bigger win for C++) (Roughly equivalent to "/Oa" on MSVC++) "-fargument-no alias" assumes arguments may not alias each other and "-fargument-noalias-global" assumes arguments may not alias each other and globals. This means if you have func(char *a, char *b) then gcc can assume a will never point to the same data items as b, so when it writes through pointer b, it doesn't need to flush registers loaded with data from pointer a. G++ by default generates awful code for class variables; take this example: typedef struct { float x, y, z; } VERTEX; class foo { public: VERTEX offset; void translate(VERTEX *vertex, int vertices_num); }; void foo::translate(VERTEX *vertex, int vertices_num) { int i; for (i=0; i Recommended compiler options for C files o Add -mhitachi to these options if using Shinobi/Kamui/Ninja For C code (using COFF, debugging): gcc -O2 -m4-single-only -mrelax -ffast-math -fargument-noalias-global -Wall -ansi -g -fno-schedule-insns2 For C code (using COFF, release): gcc -O2 -m4-single-only -mrelax -ffast-math -fargument-noalias-global -Wall -ansi For C code (using ELF, debugging): gcc -O2 -m4-single-only -mrelax -ffast-math -fargument-noalias-global -Wall -ansi -gdwarf+ -fno-schedule-insns2 For C code (using ELF, release): gcc -O2 -m4-single-only -mrelax -ffast-math -fargument-noalias-global -Wall -ansi (Please see caution notes on -fargument-noalias-global) => Recommended compiler options for C++ files o Add -mhitachi to these options if using Shinobi/Kamui/Ninja For C++ code (using COFF, debugging): gcc -O2 -m4-single-only -mrelax -ffast-math -fno-exceptions -fno-rtti -fargument-noalias-global -Wall -ansi -g -fno-schedule-insns2 For C++ code (using COFF, release): gcc -O2 -m4-single-only -mrelax -ffast-math -fno-exceptions -fno-rtti -fargument-noalias-global -Wall -ansi For C++ code (using ELF, debugging): gcc -O2 -m4-single-only -mrelax -ffast-math -fno-exceptions -fno-rtti -fargument-noalias-global -Wall -ansi -gdwarf+ -fno-schedule-insns2 For C++ code (using ELF, release): gcc -O2 -m4-single-only -mrelax -ffast-math -fno-exceptions -fno-rtti -fargument-noalias-global -Wall -ansi (Please see caution notes on -fargument-noalias-global) => Viewing gcc phase execution If you'd like to see the commands gcc passes to cpp, as and ld, then pass the "-v" option to gcc. => Can't redirect compiler errors to a file. The compiler outputs the errors to stderr - you can redirect stderr under Windows NT with "2>" like this: "gcc .... 2> errors.txt" Under Windows 95/98 the stderr is not redirectable. => Long command lines GCC supports "command files" - you can use "gcc @filename" and GCC will fetch its command line options from the specified filename. => Generating mixed C/assembly listings To generate a nice assembly listing with interspersed C source, try: gcc -O2 -m2 -g -S foo.c as -ahld foo.s > foo.lst "foo.lst" will now contain a nicely formatted mixed C/assembly listing. Sometimes the previous method does not work well, and generating an assembly listing from the object file is better. Copy the source file to the same directory as the object file, then do: objdump --disassemble --source foo.o > foo.lst => Linking in general - C or assembly It is much better to use gcc to link, and use "-Xlinker option" to pass options to the linker, than to use ld itself to link. There are at least two reasons: 1) If ld is used to link, then the library paths must be specified explicitly. This is a problem because when gcc is upgraded, the makefile will need to be changed as well to use the new libraries. If this is not done. then the makefile will use the new gcc with the old libraries, which can cause problems. 2) When linking using gcc, gcc will automatically set the ld parameters so ld will search the SH4 libraries before the SH1 libraries if -m4(-single, -single-only) is specified. If ld is used directly, then the SH4 libraries must be specified manually in the correct order relative to the SH1 libraries. => Linker cannot find symbols which are in your library files. The linker only makes one pass through your libraries, so if you have two libraries which are dependent on each other you must mention the first library, then the second library, then the first library again. It is possible to create pathological cases where the linker cannot link your libraries. In such a case, you probably want to merge the offending libraries into one single library. For reference, the default link order is: Without floating point math: -lgcc -lc -lgcc With floating point math: -lm -lgcc -lc -lgcc => Linker cannot find symbols in your application If you are programming in C++ and calling C functions, did you remember to prototype your C functions and wrap the prototypes in extern "C"? Failure to do this will cause the C function names to be C++ mangled, and the linker will not be able to find your C functions. => Anonymous unions in C files Anonymous unions are not a part of the ANSI C (X3J11) standard; they are a C++ feature implemented by some C compilers, but not by gcc. => Standard C library floating point functions If you use standard C library floating point functions, then you must link with the floating point library (libm.a) by specifying: gcc -lm The "-lm" MUST be after the filename, not before. => Undefined reference to 'pow'/'sin'/'cos' (transcendental functions) Most of the math library is in libm.a, so when you link using the gcc library, specify "-lm". => GCC complains about "garbage at end of number". You probably wrote something like "0x1800E+OFFSET". GCC interprets this as a floating pointer number because of "0E+OFFSET" looks like a floating point number. The solution is to place a space between "E" and "+". => Parallel makes If you have a dual processor Windows NT box, try "make -j4", or if you have a top-level makfile which spawns makes in subdirectories, try "make MAKE="make -j4"" ... this will start four compilers compiling in parallel. Very nifty! => Linker is slow. We have heard various reports that linking to a RAMdisk accelerates the link process by a factor of 5 or 6. Part 3 Code: ------------------------------------------------------ Helpful tips and tricks - programming with GCC and G++ ------------------------------------------------------ => Link errors occur when linking stating that constructors and virtual tables are undefined. Cause: The first virtual function declaration in the class is never defined in any of your source modules. Therefore, the virtual table/ctor/dtor definitions are never emitted. Solution: Make sure there exists a definition for all virtual functions declared in the header file - especially the first one declared (and not defined in the class declaration) - even if the functions are never used. => BSS must be cleared and stack (r15) must be set. The standard initialization code in crt0.o will set clear your BSS and gcc will automagically generate a call to ___main in your main() function which will call your global constructors for you. If you choose not to use crt0.o, you must clear your own BSS. If you choose not to call your main function "main()" then you will need to call your global constructors manually. Variables such as "int foo = 0;" will be placed in your BSS, and if your BSS isn't cleared, then (obviously) foo may have a nonzero value. The default stack in crt0.o is eight bytes - be sure to set up your own stack if you plan to do anything at all. => Use signed char/short/ints wherever possible. The SH4 automagically sign-extends when loading chars and shorts; so to use unsigned chars and shorts it must zero out the upper bits after loading. This is inefficient. Also, float-to-signed-int conversion is one instruction, but float-to-unsigned int requires a library call. => Access to hardware registers is faster using structs. GCC generates better code to access hardware registers if you properly typedef a group of hardware registers as a struct, then access the registers relative to the base pointer. This way, when accessing a group of related registers, GCC only needs to load a literal for the base pointer once for the group, instead of once for each individual register. => 'register' keyword The 'register' keyword hints to GCC that the value will not be aliased, e.g. when a value is written through a pointer the 'register'ed variable will not be overwritten. This helps GCC generate much better code. => Use "volatile" qualifier and cache-through address where necessary. GCC will load a value once and test it, if it can: int wait_flag; while (wait_flag); may generate: mov @r4,r0 L5: cmp/eq #0,r0 bf L5 unless wait_flag is declared properly. The volatile qualifier MUST be used on: a) Semaphores between mainline and interrupt code b) Pointers to hardware registers The cache-through address MUST be used on: a) Pointers to hardware registers Also, be sure variables which are accessed by cache-through are CONSISTENTLY accessed via cache-through - if you write to non-cached space, then read in cached space, you may get the cached version of the data, which will not be correct. => Be careful with cache coherency There are many ways to destroy cache coherency on the SH4, so one must be quite careful of various hardware features: a) SH4 DMA is NOT hardware cache coherent, so you must: 1) Perform a writeback (flush) of the source data, 2) Perform an invalidate of the destination data, except for the first and last cache lines which must be flushed. b) Store queue is similarly not cache coherent, so you must: 1) Invalidate the destination data if you plan to read the data. => Porting from Hitachi C Two problems we see often when converting code from SHC to GCC: A) Missing volatile qualifiers. SHC does not optimize heavily so you can be fairly sloppy with the "volatile" qualifier and code will still work. Not so with GCC - it optimizes very heavily and it will remove redundant loads. See item #3 for specific circumstances where volatile must be used. B) Not preserving MACL and MACH in interrupt-called routines. SHC implements "callee preserves MACL and MACH" calling convention, whereas GCC does not. If you have interrupt routines written in SHC and you port them to GCC, they will not preserve MACL and MACH and your mainline code will become flaky due to MACL/MACH corruption. => #pragma interrupt "#pragma interrupt" specifies the next function defined is an interrupt handler. This pragma instructs GCC to: 1) Save all modified registers. (r0-r14, macl & mach) 2) Terminate the function with an "rte" instead of an "rts". If you are declaring several interrupt handlers you must insert "#pragma interrupt" before every interrupt function. => GCC code quality GCC produces code which is (IMHO) pretty good for SH4, but not great. GCC has very good target-independent optimization (common subexpression elimination, invariant loop expression hoisting, etc) but mediocre SH-specific optimizations. On complicated functions one can produce code which runs 50% faster than GCC output. Factors which affect GCC's code quality for complicated functions: 1) Since the SH4 has only a "few" registers (compared to PPC/MIPS) the first scheduling pass is disabled, because it creates many register spills. The main effect of this is that code scheduling can be somewhat weak. => Inline assembly in C functions Be sure to declare your inline assembly as "__asm__ volatile" instead of just "__asm__" so gcc won't try to optimize and/or remove your inline assembly code. GCC has been known to optimize bits of inline assembly which aren't volatile, producing strange results. Also, if the two colons which separate the register constraints at the end of an inline assembly fragment are missing, GCC will occasionally crash. Be sure to place two colons even if there are no register constraints in the inline assembly. Incidentally, GCC will allow you to put an inline assembly section outside of a function, so you can write whole assembly routines in a C source file without wrapping them inside a C function! When using assembly outside of a C function, omit the "volatile" keyword since it isn't necessary and will only confuse the compiler. Note: Inline assembly in a C source file outside of a function is not source-level debuggable - if you wish to write large amounts of assembly code it's recommended you create a separate assembly file so it may be properly source-level debugged. => gprof/gcov gprof and gcov are included, but do not work well on the target system - there is no filesystem support to create files. => Standard C library calls - malloc(), free(), creat(), etc. They exist, but they require host operating system support through trap #34. See the source for more details. => .ctor and .dtor sections .ctor is the section for global C++ constructors, and .dtors are the section for global C++ destructors. They require appropriate labels when linking to delimit their start and end addresses, like this: . = ALIGN(4); __ctors = *; *(.ctors) __ctors_end = *; . = ALIGN(4); __dtors = *; *(.dtors) __dtors_end = *; (this needs to be in your linker script) => Structure padding The SH series microprocessors cannot access words on byte boundaries or longwords on byte boundaries. Words must be word-aligned; longwords must be longword-aligned. This means if your structures contain shorts or longs, the compiler will add padding on the previous element to align the short or longs to an acceptable address. The "-mpadstruct" option does NOT control this behavior. It controls a different structure padding behavior; it enables structure padding at the end of the structure to pad it out to a multiple of four bytes, which is backwards compatible with gcc-2.6.3's broken behavior. => _end The "_end" label is where sbrk() (called by malloc()) starts to allocate memory. Basically, it designates the beginning of the heap. When the heap pointer overlaps r15, the standard C library assumes the heap has overflowed into the stack. See "Interfacing with host OS" for more details. => Overlays See the file "overlay.txt" for more info on how to create overlays. => ___set_fpscr If you are compiling with an option which supports 64-bit floating-point values (-m4 or -m4-single) then the routine ___set_fpscr MUST be called with a single argument during initialization. This routine intializes the ___fpscr_values table which is used by the gcc code to switch the FPU from single to double precision and back. The value passed to ___set_fpscr is the default fpscr bits used in both single and double modes. Normally this is zero. Also, if you desire to change an FPU bit permanently, this can be done by calling ___set_fpscr since it modifies the values which gcc loads into the fpscr. The first entry in the fpscr_values table specifies the fpscr value for the opposite of the default mode (single for -m4, double for -m4-single) and the second value specifies the fpscr value for the default mode (double for -m4, single for -m4-single). NOTE: GCC assumes the SZ bit is always zero in little-endian!! Do not set bit 20 when calling ___set_fpscr!! => Spinlocks The atomic "test and set" instruction "tas.b" should be used to implement spinlocks. Here is a macro which generates tas.b: #define tas_byte(x) \ ({ \ char *address; \ int flag; \ address = &x; \ __asm__ volatile ("tas.b @%1; movt %0": "=r" (flag): "r" (address): "t"); \ flag; \ }) => Memory allocation functions & notes Documentation on the memory allocation functions used in this package can be obtained at http://g.oswego.edu/dl/html/malloc.html. => Interfacing with the host OS The standard C library included with gcc performs a "trapa #34" to call the host operating system for basic services. Here are the routines which are called: read: trapa #34, r4: SYS_read, r5: file, r6: ptr, r7: len lseek: trapa #34, r4: SYS_lseek, r5: file, r6: ptr, r7: dir write: trapa #34, r4: SYS_write, r5: file, r6: ptr, r7: len close: trapa #34, r4: SYS_close, r5: file, r6: 0, r7: 0 sbrk: this is internal to libc now, just set "_end" to the end of your program + data area, and it will allocate from "_end" until it hits the stack pointer. open: trapa #34, r4: SYS_open, r5: path, r6: flags, r7: 0 creat: trapa #34, r4: SYS_creat, r5: path, r6: mode, r7: 0 exit: trapa #34, r4: SYS_exit, r5: n, r6: 0, r7: 0 kill: trapa #34, r4: SYS_exit, r5: 0xdead, r6: 0, r7: 0 stat: trapa #34, r4: SYS_stat, r5: path, r6: st chmod: trapa #34, r4: SYS_chmod, r5: path, r6: mode chown: trapa #34, r4: SYS_chown, r5: path, r6: owner, r7: group utime: trapa #34, r4: SYS_utime, r5: path, r6: times fork: trapa #34, r4: SYS_fork wait: trapa #34, r4: SYS_wait execve: trapa #34, r4: SYS_execve, r5: path, r6: argv, r7: envp execv: trapa #34, r4: SYS_execv r5: path, r6: argv pipe: trapa #34, r4: SYS_pipe, r5: fd ----------------------------------- Helpful tips and tricks - debugging ----------------------------------- => Easier source-code debugging If you disable the GCC/G++ second scheduling pass with "-fno-schedule-insns2" your code will be easier to debug because your source code pointer will jump around less in the debugger. This changes the compiler's code generation, and the code will be slower, so it could affect the bug you're trying to isolate, so please be careful with its usage. I also recommend NOT using -fomit-frame-pointer when debugging. The debugger relies upon the frame pointer to access local variables, so if the frame pointer is omitted, the debugger will not be able to properly display local variables for that function! => Post-morteming crashes When the SH4 crashes, the best thing to do is to examine the PR register to determine the return address of the offending routine. This will often tell you which routine has crashed. This applies to C and C++ code as well as assembly code. => C++ name demangler There is a C++ name demangler called "c++filt" in this package. If you type in a mangled name, the utility will return a demangled function name and arguments. ---------------------------------------------- Helpful tips and tricks - assembly programming ---------------------------------------------- => The .align directive .align in GAS specifies the number of least-significant zero bits, NOT the actual alignment value. For example. .align 1 is 2 byte aligned, .align 2 is 4 byte aligned, .align 5 is 32-byte aligned. => Using literals in assembly Don't forget to align your literals! (.align 2) Failure to do so will cause you to spend hours tracking down bus error bugs! Also, if you have word literals put them at the end of your literal section so they won't misalign any following longword literals. When using literals in assembly, be sure to place your byte literals first, then an .align 1 (for 2 byte alignment), then all the word literals, then a .align 2 (for 4 byte alignment) then all your longword literals. We recommend this because byte and word literals have smaller pc-relative offsets than longword literals. GAS also will not report illegal instructions (jump/branch) or stupid (i.e. pc-relative) instructions in delay slots. => GCC parameter passing conventions: Integer registers: r0: Function integer return value (if any) (r1 also used if return is type "long long") r0-r3: Temporaries. Assumed destroyed by called function. r4-r7: First four integer arguments. Assumed destroyed by called function. r8-r15: Assumed preserved by called functions. Note: Be sure to sign or zero-extend your chars and shorts in r4-r7 before calling a C function! GCC expects arguments in registers to be fully sign extended to an int. Floating-point register: fr0: Function floating-point return value (if any) fr0-fr3: Temporaries. Asssumed destroyed by called function. fr4-fr11: First eight floating-point arguments. Assumed destroyed by called function. fr12-fr14: Assumed preserved by called function. Other registers: pr: Processor return - must be preserved, obviously. gbr: gcc currently does not generate code which utilizes the gbr - so your assembly routines are free to use the gbr. macl, mach, fpul: Assumed to be destroyed. fpscr: depends on -m4, -m4-single, or -m4-single-only. => Source-level debugging GAS assembly The easiest way to assemble assembly files into .o files with debugging information is to use the ".S" extension (MUST be uppercase) and assemble them with gcc: gcc -g -c file.S gcc will call cpp and as with the appropriate commands to properly assemble the files with source-level debug information. Incidentally, this also enables you to use all C preprocessor commands, e.g. "#define", "#if ... #else ... #endif", in your assembly source files. => Stupid (but useful) assembly tricks: A) Use MACL, MACH, and GBR as temporaries. B) Save pr in a register (sts.l pr,r14) Part 4 Code: ---------------------- Writing efficient code ---------------------- => Use local variables. Global variables are slow - to retrieve the value, the SH4 typically must execute: mov.l L2,r1 mov.l @r1,r1 Local variables are faster - it's stack-relative, and parameters are even faster because the first four integers parameters are passed in r4-r7 and first eight floating-point parameters in fr4-fr11. => Write small functions. We've noticed GCC generates very pessimal code when it starts to spill registers, so try to avoid doing too much in one function. A function which exceeds more than about a hundred lines should be broken into smaller functions. => Read structure fields sequentially forwards; but write them sequentially in reverse (if possible). This works well because gcc can use post-increment reads when reading, and pre-decrement writes when writing. => Use the "register" keyword (mostly for locals) The usage of the "register" keyword allows gcc to determine that memory writes cannot possibly overwrite this variable, so it's more likely to be registerized. => Avoid weird casts The ANSI C standard allows the compiler to assume that pointers to different types do not alias; e.g. int * and float * never point to the same location. If you cast floats to ints or to other types then you may confuse the compiler and incorrect code may be generated. => When defining automatics in functions, define non-array items first, in order of number of times accessed, then array items. For this sample: void func(void) { int foo[256]; volatile int bar; bar = 0; } GCC will generate this code, which requires two instructions to access "bar": _func: mov.w L3,r3 mov.l r14,@-r15 mov.w L4,r0 <- r0 = 1024 mov.w L3,r7 sub r3,r15 mov #0,r2 mov r15,r14 mov.l r2,@(r0,r14) <- bar at r15 + 1024 add r7,r14 mov r14,r15 rts mov.l @r15+,r14 .align 1 L3: .short 1028 L4: .short 1024 If you rearrage the function so that "bar" is defined first: void func(void) { volatile int bar; int foo[256]; bar = 0; } ~ then GCC will generate this code, which only requires a single instruction to access "bar": _func: mov.w L3,r3 mov.l r14,@-r15 mov.w L3,r7 sub r3,r15 mov #0,r1 mov r15,r14 mov.l r1,@r14 <- bar at @r14 add r7,r14 mov r14,r15 rts mov.l @r15+,r14 .align 1 L3: .short 1028