| |
|
Summary of changes from 5.57 to 5.60
Summary of changes from 5.57 to 5.60
- Added support for switch on a long long expression.
- Added complex and imaginary number support (, including
C99 Annex G).
- _Bool (and _Complex/_Imaginary) now available in pcc and C90 modes.
- MLA instruction used more often, and handled more intelligently.
- CSE aliasing optimisations take account of "restrict" qualifier.
- Optimised handling of narrow (<32 bit) data and computations.
- Will use ARMv5E's SMULxy and SMLAxy instructions where appropriate.
- Improved checking of printf/scanf format strings.
- Inlines signbit().
- Improved compilation of (int) longlong, (int) longlong >> 32, and
longlong >> 1 or << 1. Also, long long multiplication and division by
powers of two transformed into shifts.
- Can transform integer division by constant into a 32x32->64
multiplication (if available on CPU, and -Otime selected).
- Pointer subtraction optimised to use only multiply and/or shift.
- Added inter-statement compile-time evaluation of long long arithmetic.
- Improved CSE handling, especially for FP constants and comparisons.
- IEEE 754 conformance improved; generally edging the compiler closer to
C99 Annex F.
- asm keyword recognised in C++ mode.
- -arch command-line parameter added.
- Some improvements to treatment of volatile objects.
- Changes to handling of floating arguments for -apcs /fpregargs.
- Numerous code generation improvements and bug fixes.
- Banner and help now sent to stderr instead of stdout.
Extra notes
restrict
Previously the restrict qualifier was recognised, but didn't affect code
generation. As of version 5.59, restrict will lead to improved code in some
circumstances. See the separate document for detailed examples.
SMULxy/SMLAxy
These instructions, new in architecture 5TE, provide the ability to work on
16-bit signed numbers packed into 32-bit words, and are of particular use in
certain signal processing applications (such as video decompressors).
The compiler can now generate these instructions for multiplications of
narrow, signed values. (Previously they could only be accessed from inline
assembler).
For example,
int mul1(short *a)
{
return a[1]*a[2] + a[3]*a[0];
}
will compile as:
mul1
LDRH a2,[a1,#2]
LDRH a3,[a1,#4]
SMULBB a2,a2,a3
LDRH a3,[a1,#6]
LDRH a1,[a1,#0]
SMLABB a1,a3,a1,a2
MOV pc,lr
As this example illustrates, the compiler actually loads shorts into separate
registers, limiting its ability to take full advantage of the unpacking
ability. To give the compiler a hint, you can either manually unpack the
values from ints (using masks and shifts), or use bitfields.
For example:
struct hl
{
signed int l:16,h:16;
};
int mul2(struct hl *a)
{
return a[0].h*a[1].l + a[1].h*a[0].l;
}
will compile as:
mul2
LDR a2,[a1,#0]
LDR a1,[a1,#4]
SMULTB a3,a2,a1
SMLATB a1,a1,a2,a3
MOV pc,lr
Such packing may result in reduced performance with other operations.
-apcs /fpregargs
The calling convention for /fpregargs has been changed to align better with
ARM's later tools. Note that /fpregargs is not normally used under RISC OS.
The changes are:
- FP arguments are never passed in FP registers for variadic functions.
Instead they're passed in integer registers or on the stack. In -pcc mode,
FP registers are not used unless the first argument to a function is
floating. (See ARM's ATPCS documentation for an explanation of this
heuristic.)
- Homogenous floating-point structure arguments are now passed in
floating-point registers. For example, a structure consisting of 3 floats
(only) will be passed in 3 adjacent floating-point registers. This
includes complex numbers, which will be passed in a pair of registers. If
the appropriate number of registers is not available, then the entire
structure is passed on the stack - a structure cannot be split between FP
registers and the stack.
- Homogenous floating-point structure returns, using __value_in_regs,
are now returned in F0-F3, regardless of the APCS setting. Complex
numbers are returned in F0 and F1.
-arch
The compiler now tunes its output to match processor timing characteristics,
depending on the setting of -cpu or the new option -arch. In particular, the
number of cycles for LDR and MUL are considered.
The new command-line argument -arch, when used on its own, is equivalent to
-cpu, except it only accepts architecture names (eg "-arch 5TE"). When both
-cpu and -arch are used simultaneously, the compiler will optimise for the
processor/architecture given by -cpu, but generate code that will run on the
architecture given by -arch.
This allows the user to request optimising for a new processor while not
ruling out the code being run on an older one.
For example:
| default | code optimised for ARM7, runs on ARMv3 |
| -cpu ARM7TDMI | code optimised for ARM7TDMI, runs on ARMv4T |
| -cpu 5 | code optimised for a typical ARMv5 processor, runs on ARMv5 |
| -arch 5 | same as -cpu 5 |
| -arch 3 -cpu XScale | code optimised for XScale but runs on ARMv3 |
| -cpu ARM2 -arch 5TE | code optimised for ARM2, but requires ARMv5TE (nonsensical, but allowed) |
-arch and -cpu can be specified in either order. Only the last -cpu and last
-arch given on the command line are significant.
|