Patch to fix bug 39967
There are a few further optimizations that can be made, but this is many times better than scalarizing.
These take up to 11 cycles, however, the smallest possible case is 7 with two pre-interleaved values, or 4 if we are using a single vmull.
This automatically tries to select the proper routine.
twomul is used for multiplies that could not be simplified. It takes 7 instructions and 11 cycles. It looks like so:
vmovn.i64 topLo, top vmovn.i64 botLo, bot vrev64.32 bot, bot vmul.i32 bot, bot, top vpaddl.i32 bot, bot vshl.i64 top, bot, #32 vmlal.u32 top, botLo, topLo
ssemul is simpler, clearer, and more modular. It has the same timing as twomul, but it has one more instruction.
However, if something is simplified, such as preinterleaving a load/constant or removing multiplies on bits that are known to be zero, ssemul ends up being faster and shorter.
vmovn.i64 topLo, top vshrn.i64 topHi, top, #32 vmovn.i64 botHi, bot vshrn.i64 botLo, bot, #32 vmull.u32 ret, topLo, botHi vmlal.u32 ret, topHi, botLo vshl.i64 ret, ret, #32 vmlal.u32 ret, topLo, botLo
Some missing optimizations:
- Masking a uint64x2_t/v2i64 by 0xFFFFFFFF shouldn't actually generate a vand, it should remove one multiply and one vshrn.
- Constant interleaving is put into two vldr instructions. This might not be the most efficient, as I want a vld1.64 I don't know how adr comes into play, but I think it could be run in parallel with shifting on the other multiple.
- Load interleaving should be implemented. If a multiple is used only once and it is loaded from a pointer, by replacing vld1.64 with vld2.32 and using the two uint32x2_t/v2i32 values generated, we can also avoid vshrn/vmovn.
- Cost model is pretty weird with v2i64 in NEON. We should probably fix the "add 4 to the cost" hack, as NEON v2i64 vectors are not as expensive as they are made out to be with that huge penalty.