If we cannot otherwise use a VMOVimm/VMOVFPimm/VMVNimm, fall back to producing a VDUP(const) as opposed to a constant pool load. This will at least be smaller codesize and can allow the VDUP to be folded into other instructions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Looks like an obviously good thing, and I only have one nitpick.
llvm/lib/Target/ARM/ARMISelLowering.cpp | ||
---|---|---|
7648 | You've used VECTOR_REG_CAST where other branches of this code have BITCAST. As far as I can see, either one will work provided the constant is constructed right (e.g. if you wanted to make a v16i8 containing 1,2,3,4,1,2,3,4,... then you might have to vdup 0x01020304 or 0x04030201 depending which cast you wanted to use afterwards). But I don't see any big-endian test to demonstrate it picking the right one. Unless I've missed one, could you add it? |
Added two new test cases, mov_int8_1234 that does like you said i8 <1,2,3,4,1,2,3,4,..> and mov_int32_16908546 which is 0x1020102 VDUP'd as a i16.
llvm/test/CodeGen/Thumb2/mve-vmovimm.ll | ||
---|---|---|
37–39 | I think this output is right, but it confused me completely for a while and I had to try it in emulation to convince myself! In the middle of a larger function, I think that if you wanted to make this 1,2,3,4,1,2,3,4,... vector and then immediately apply another v16i8 operation to it, you would vdup the same 32-bit constant 0x04030201 regardless of endianness, because the logical 'lane 0' of the vector always occupies the low-order bits. And the reason why the output is different between LE and BE in this context is that the vdup is immediately followed by a function return, which in BE requires an extra vrev due to the vector register PCS. And that function-return vrev has been folded into the constant, which is why it's the other way round here. So, I think this is the right output, but it might benefit from a comment in case the next reader gets as confused as I did! |
You've used VECTOR_REG_CAST where other branches of this code have BITCAST.
As far as I can see, either one will work provided the constant is constructed right (e.g. if you wanted to make a v16i8 containing 1,2,3,4,1,2,3,4,... then you might have to vdup 0x01020304 or 0x04030201 depending which cast you wanted to use afterwards). But I don't see any big-endian test to demonstrate it picking the right one. Unless I've missed one, could you add it?