Page MenuHomePhabricator

[Builtin][ARM] Implement addsf3/__aeabi_fadd for Thumb1
ClosedPublic

Authored by weimingz on Feb 2 2017, 10:52 PM.

Details

Summary

This patch implements addsf3/__aeabi_fadd in asm for Thumb1.
Compared with generic C version (lib/fp_add_impl.inc), it

  1. all constants are materialized instead of loading from constant pool
  2. no stack spills (C version uses 136 bytes stack space)
  3. clz() is called only when necessary. (C version always calls it)

The asm is able to be extended to ARM/Thumb2 and for double fp easily.

In real projects, this asm brings the performance on par with gcc on cortex-m3. It was about 25% behind.

Event Timeline

weimingz created this revision.Feb 2 2017, 10:52 PM

all constants are materialized instead of loading from constant pool
no stack spills (C version uses 136 bytes stack space)
clz() is called only when necessary. (C version always calls it)

If we have to hand-write assembly to efficiently materialize constants, something has gone very wrong. It's better to fix the compiler if we can; please file bugs for these issues.

rengolin edited edge metadata.Feb 3 2017, 1:13 PM

This is thumb 1, we have no quality control or tracking. We can fix the compiler, but that doesn't stop us from optimising a run Tim library with assembly code. The compiler will never be perfect and making this optimal in the compiler will take a lot longer than a patch or two.

Maybe that was worded a bit too strongly; I'm not arguing we shouldn't merge this. We should record the issues we're working around at llvm.org/bugs/ , though.

My previous comment of "136 bytes of stack usage" is inaccurate. I just find the default build doesn't have any -O flag. Is it an issue without a default -O flag?

Internally, we use -Os.
Using -Os, it uses 24 bytes in stack for spilling.

Regarding constants, it uses the following: (some I can figure out easily)
238: 7fffffff .word 0x7fffffff ==> this mask is to get the Abs value. In Asm, we do lsl and lsr. Since we need to do Abs twice, compiler may think this is better. But ld is more expensive.
23c: 007fffff .word 0x007fffff ===> similar mask. Used twice
240: 55555555 .word 0x55555555 ==> no idea
244: 33333333 .word 0x33333333 ==> no idea
248: 0f0f0f0f .word 0x0f0f0f0f ==> no idea
24c: 01010101 .word 0x01010101 ==> no idea
250: 7fc00000 .word 0x7fc00000 ==> Similar mask

On code size, addsf3.c.o (using -Os) vs addsf3.S.o is 604 vs 312 bytes.
For performance, on a armv7 Android device, it's 1.4x faster using my own micro benchmark. I don't have the number on a cortex-m0. But on real project, this asm version let us close the 25% gap against gcc.

asl accepted this revision.Feb 4 2017, 1:26 AM
This revision is now accepted and ready to land.Feb 4 2017, 1:26 AM
This revision was automatically updated to reflect the committed changes.

240: 55555555 .word 0x55555555 ==> no idea

This is getting used by the inlined implementation of ctlz.