This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add support for efficient bitcast in vector truncate store.
ClosedPublic

Authored by lawben on Apr 14 2023, 2:27 AM.

Details

Summary

Following the changes in D145301, we now also support the efficient bitcast when storing the bool vector. Previously, this was expanded.

Diff Detail

Event Timeline

lawben created this revision.Apr 14 2023, 2:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 2:27 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
lawben requested review of this revision.Apr 14 2023, 2:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 2:27 AM
lawben retitled this revision from Add support for efficient bitcast in vector truncate store. to [AArch64] Add support for efficient bitcast in vector truncate store..Apr 14 2023, 2:31 AM
dmgreen added inline comments.Apr 18 2023, 12:34 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1223

v8i1? It doesn't appear that these are necessary though, as the trunc stores are already created and the new code goes through a combine.

You could alternatively add the call to combineVectorCompareAndTruncateStore to lowerSTORE instead of the combine.

llvm/test/CodeGen/AArch64/vec_uaddo.ll
1

Remove this?

llvm/test/CodeGen/AArch64/vec_umulo.ll
1

Remove this?

lawben added inline comments.Apr 18 2023, 3:23 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1223

regarding moving it to lowerSTORE, before I go down that path and change the code. i originally had it in there and then moved it to the "combine" pass, because the truncating store is legalized _after_ the setcc. so we have lots of compare logic with intermediate XOR etc., which hinders current other passes to fully optimize them out again. we are essentially calling truncate, then sign extend. the sign extend is removed if it knows that numSignBits is equal to the element's size. but we lose this information after legalization . if i have this logic applied before legalization, we don't run into that issue.

I'm happy to move the code and figure our how to get the optimizations to apply again if lowerSTORE is the correct location. if this was just a suggestion, i think leaving it in the combine pass is better. just wanted to get your opinion before implementing it.

Does having sign bits for the vector compare nodes help, as in https://reviews.llvm.org/D148624? I wasn't able to find a good way to test that, the optimizer would always clear away the results for all the examples I tried prior to creating the compare nodes.

Does having sign bits for the vector compare nodes help, as in https://reviews.llvm.org/D148624? I wasn't able to find a good way to test that, the optimizer would always clear away the results for all the examples I tried prior to creating the compare nodes.

Unfortunately, it does not. I'll try to dig into this tomorrow when I have some time. I'm not familiar with how the sign extend is removed but if we have the new method in lowerSTORE, there is an xor with a 0-vector, which I'm not sure passes the known sign bits check. I'll let you know once I have some more insights from digging into it. Maybe I can get this to work,

I dug into this a bit, and your vector compare sing bit patch does the right thing here. But the problem is unrelated and a bit annoying. At the time the sign_extend_inreg that we added is combined (to potentially remove it), there is a build_vector with (1 << vectorElementSize) - 1 bits (to negate via xor). But ComputeNumSignBits breaks here. The vector has 32-bit constants but we only have 8-bit vectors. So it detects 24 sign bits and then kinda gives up. So without changing code in SelectionDAG::ComputeNumSignBits or the negation of vector entries, there is no way to correctly determine the sign bits.

If you have any ideas how to overcome this, please let me know. Otherwise, I'd leave this in performSTORECombine to apply it before legalization (vec != 0 --> (vec == 0) ^ 255 ) breaks it.

I dug into this a bit, and your vector compare sing bit patch does the right thing here. But the problem is unrelated and a bit annoying. At the time the sign_extend_inreg that we added is combined (to potentially remove it), there is a build_vector with (1 << vectorElementSize) - 1 bits (to negate via xor). But ComputeNumSignBits breaks here. The vector has 32-bit constants but we only have 8-bit vectors. So it detects 24 sign bits and then kinda gives up. So without changing code in SelectionDAG::ComputeNumSignBits or the negation of vector entries, there is no way to correctly determine the sign bits.

If you have any ideas how to overcome this, please let me know. Otherwise, I'd leave this in performSTORECombine to apply it before legalization (vec != 0 --> (vec == 0) ^ 255 ) breaks it.

Yeah that sounds OK. There might be a way to fix it by teaching something to truncate constants properly, but many of those things can lead to more and more issues that need to be accounted for. Having this as a combine sounds like it should be fine, they just needs to more careful about illegal types and it looks like those should be accounted for.

I think it can remove the setTruncStoreAction's if it goes through a combine?

lawben updated this revision to Diff 516057.Apr 22 2023, 4:52 AM

Address a few review comments.

@dmgreen If you are concerned about legal types, should I add a check to only perform this before legalization?

@dmgreen If you are concerned about legal types, should I add a check to only perform this before legalization?

It would be the other way around, performing the transform after legalization so that we know all types have been legalized. But that runs into the same issues as doing it during lowering. It is worth making sure there are tests for illegal types with odd number vector lanes and non-power2 integer types.

llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll
228

Some of these with low vector lanes are starting to look worse than the code before. The fmov/strb could be done on the fp side, but I don't think that would be enough to make then profitable. Is it worth limiting it to >= 4 vector lanes?

lawben updated this revision to Diff 517109.Apr 26 2023, 2:41 AM

Added check for illegal types. I could not get this check to actually fire. The type is legalized before the bitcast in the bitcast lowering case, and it is legalized before the
store becomes a truncating store. So in both cases other checks prevent bad things from happening. But it is probably still fine to keep it as a defensive check in case things
change.

lawben added inline comments.Apr 26 2023, 2:47 AM
llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll
228

On my M1, it is still faster though. But I agree that this needs a bit of investigation on other ARM CPUs.

Suggestion: In a follow-up patch, with the changes to v16i8 suggested by @efriedma, I'll run a set of benchmarks for the v16i8 and the v2i64 (and other) cases on my M1, Graviton 2, Graviton 3, and a Pi 4 (and possibly an A64FX, but that has terrible NEON performance across the board). I have this setup for another project at the moment anyway. Then this should give us a bit of a wider range of performance characteristics.

So I'd suggest leaving this as is in this patch and than doing a performance-based follow-up patch. Thoughts?

dmgreen accepted this revision.Apr 26 2023, 8:59 AM

Thanks for checking about the performance. LGTM in that case.

llvm/test/CodeGen/AArch64/vec-combine-compare-truncate-store.ll
228

Sure - I was just going from the number of instructions and the extra constant pools. The critical path looks longer (but that might not matter much), and there are less FPR->GPR transfers that will help.

This revision is now accepted and ready to land.Apr 26 2023, 8:59 AM

@dmgreen Thanks for your review. Could you please merge this with "Lawrence Benson <github@lawben.com>".

Oh yes, sorry I keep forgetting. If you have more patches you intended to submit then you could probably ask for commit access. I will commit this one now. Thanks for working on the patch.