This patch implements addsf3/__aeabi_fadd in asm for Thumb1.
Compared with generic C version (lib/fp_add_impl.inc), it
- all constants are materialized instead of loading from constant pool
- no stack spills (C version uses 136 bytes stack space)
- clz() is called only when necessary. (C version always calls it)
The asm is able to be extended to ARM/Thumb2 and for double fp easily.
In real projects, this asm brings the performance on par with gcc on cortex-m3. It was about 25% behind.