If we cannot otherwise use a VMOVimm/VMOVFPimm/VMVNimm, fall back to producing a VDUP(const) as opposed to a constant pool load. This will at least be smaller codesize and can allow the VDUP to be folded into other instructions.
Details
Diff Detail
Event Timeline
Looks like an obviously good thing, and I only have one nitpick.
llvm/lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
7681 | You've used VECTOR_REG_CAST where other branches of this code have BITCAST. As far as I can see, either one will work provided the constant is constructed right (e.g. if you wanted to make a v16i8 containing 1,2,3,4,1,2,3,4,... then you might have to vdup 0x01020304 or 0x04030201 depending which cast you wanted to use afterwards). But I don't see any big-endian test to demonstrate it picking the right one. Unless I've missed one, could you add it? |
Added two new test cases, mov_int8_1234 that does like you said i8 <1,2,3,4,1,2,3,4,..> and mov_int32_16908546 which is 0x1020102 VDUP'd as a i16.
llvm/test/CodeGen/Thumb2/mve-vmovimm.ll | ||
---|---|---|
23–24 | I think this output is right, but it confused me completely for a while and I had to try it in emulation to convince myself! In the middle of a larger function, I think that if you wanted to make this 1,2,3,4,1,2,3,4,... vector and then immediately apply another v16i8 operation to it, you would vdup the same 32-bit constant 0x04030201 regardless of endianness, because the logical 'lane 0' of the vector always occupies the low-order bits. And the reason why the output is different between LE and BE in this context is that the vdup is immediately followed by a function return, which in BE requires an extra vrev due to the vector register PCS. And that function-return vrev has been folded into the constant, which is why it's the other way round here. So, I think this is the right output, but it might benefit from a comment in case the next reader gets as confused as I did! |
You've used VECTOR_REG_CAST where other branches of this code have BITCAST.
As far as I can see, either one will work provided the constant is constructed right (e.g. if you wanted to make a v16i8 containing 1,2,3,4,1,2,3,4,... then you might have to vdup 0x01020304 or 0x04030201 depending which cast you wanted to use afterwards). But I don't see any big-endian test to demonstrate it picking the right one. Unless I've missed one, could you add it?