This patch is using the new intrinsics introduced in
https://reviews.llvm.org/D141924
to enable vecorization of interleaved accesses for scalable VF.
Targets need to implement a proper cost model for supported operations
to make sure that generated IR can be code generated.
Details
Diff Detail
Event Timeline
llvm/lib/IR/IRBuilder.cpp | ||
---|---|---|
596 | Please use PoisonValue here and whenever possible as we are trying to get rid of UndefValue. |
Thanks for adding this! I'm currently plugging in the hooks for RISC-V and will let you know what I run into.
llvm/lib/IR/IRBuilder.cpp | ||
---|---|---|
587 | Maybe this should be an assertion | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
2721 | Need to check that Group->getFactor() == 2 here or that the call to CreateMaskedInterleavedLoad succeeds | |
2809 | Need to check Group->getFactor() == 2 here too |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
2751 | It's somehow possible to reach here with a scalable vector type if TII->hasInterleavedLoad returns false. |
llvm/include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
835–839 | I think these functions could actually be joined into 1, something like: supportsInterleaving(VectorType *VecTy, uint32_t Factor, bool IsMasked) I think it is going to be unlikely that target supports store but not load for the same vector type, @paulwalker-arm what do you think? | |
llvm/lib/IR/IRBuilder.cpp | ||
587 | Hi Luke, | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
2721 | So my idea was that it would be up to the hasInterleavedLoad function to make sure that it returns true only when Factor is 2, so no extra checks is needed I think. | |
2751 | So it is actually connected by the LV cost model, the LoopVectorizationCostModel::getInterleaveGroupCost is calling TTI.getInterleavedMemoryOpCost which should return invalid cost for factors different than 2. |
llvm/include/llvm/IR/IRBuilder.h | ||
---|---|---|
771 | This is the wrong interface. The IRBuilder interface should provide a way to create the interleave and deinterleave instrinsic calls. That interface should generate shuffles for fixed vectors. Then the calling logic in the vectorizer should worry about emitting the load/store. (That's the existing structure in fact.) | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
2987 | The changes to this function are NFC for fixed length vectors, and a generally useful scalable cleanup. Please separate and land this change without the need for further review. This applies *only* to the changes in this function so as to shrink the diff for future review. | |
llvm/test/Transforms/LoopVectorize/sve-interleaved-accesses.ll | ||
1 | This should be in the AArch64 sub-tree, and probably precommited. Depending on your confidence in the AArch64 code, you may want to separate that into it's own review. |
Separated out the bitcast fix and committed. Precommitted tests.
Changed IRBuilder interface to focus on the intrinsics (and fixed-length shuffle equivalents) instead of mixing in loads/stores.
There's a few unit tests which will fail with the new interface -- although we generate the same IR instructions, the order is different. Assuming the interface is suitable I'll update the tests before posting the next patch revision.
llvm/tools/llvm-profdata/CMakeLists.txt | ||
---|---|---|
7 ↗ | (On Diff #511067) | This is due to the fixed-length mask generation code being in VectorUtils. This isn't the only tool affected, though oddly enough I've only observed build failures on X86 and not AArch64 hosts. I've included it as a representative. I would prefer not to make changes to a bunch of cmake files for this, so I'm currently leaning towards either duplicating the mask generation code or moving it into Core. Any preferences from a reviewer? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
125–127 | Hi @huntergr , I have one question, the interface you proposed looks clean and nice, however it forces code generation for the deinterleaving/interleaving intrinsics to be implemented before merging this patch, am I correct? The reason why I had this option here is that it would allow us to merge this patch before other pieces are implemented. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
125–127 | Hi @mgabka , We do have code generation for these intrinsics already, they just get lowered to zips/uzips. See D141924. D146218 will match to ld2/st2 where possible (which is what we want), and should perhaps land first. The changes to isLegalInterleavedAccessType will also be needed there, so the next version of this patch can just rely on that. |
llvm/lib/IR/IRBuilder.cpp | ||
---|---|---|
1343 | cast<> | |
1384 | Given this assert, we shouldn't need to pass Factor in here at all. | |
1387 | cast<> | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
2730 | Having this be only in the normal load path seems unlikely to be correct. Surely we must also handle masked loads as well? | |
2763–2764 | It looks like you're changing the handling for gaps in the deinterleave. This seems surprising and worth some discussion? |
- Implemented requested changes to utility functions (casts, removing redundant parameter)
- Moved utility functions to VectorUtils instead of IRBuilder; this removes the problem that introduced a dependency on the Analysis component for several tools which don't need the funtionality.
- Updated affected tests.
- Full check output included for the strict fadd tests.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
2730 | This does handle masked loads -- 'NewLoad = Builder.CreateAligned....' is a standalone statement on the else with no opening brace. I've added a blank line to perhaps make that a little more obvious. Unless there's something else I've missed? | |
2763–2764 | That was the result of a bit of overzealous cleanup on my part when removing some code from the original patch; I missed the 'continue'. Reverted. |
llvm/include/llvm/Analysis/VectorUtils.h | ||
---|---|---|
596 ↗ | (On Diff #513595) | Unless you have plans to reuse these, this is just an implementation detail of the vectorizer. As such, these would be better as static functions in LoopVectorize.cpp |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
2730 | Yeah, I got confused by the brace style in the code above. | |
2746–2751 | The interface here feels really awkward for fixed length vectors. We have to create this dummy struct type, construct it, destruct it, and we loose the ability to slice out the inactive lanes. I almost wonder if this code would be clearer without the helper function at all. With an explicit version based on scalable type here, we could do a simplified version of this loop with an early return and leave the fixed length codegen unchanged. I'd be tempted to try that and see if the overall code quality looked reasonable. You could also try a lambda which enumerate the active lanes (i.e. doing the shuffle or extract as required), and move the handling of the bitcast and reverse to a callback. This might be too much complexity though. | |
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll | ||
1 ↗ | (On Diff #513595) | Please submit a separate change to autogen this file, and then rebase. Same with the other file you switched to autogen. |
- Moved interleaveVectors to a static function in LoopVectorize
- Removed deinterleaveVector, inlined intrinsic creation. This means the shuffles for fixed-length loads won't be changed, though we do end up with a little duplication as a result.
- Precommitted autogen checks for scalable-strict-fadd.ll
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
14373 | Looks like we can use MinElts and avoid duplicating EC.getKnownMinValue() |
I can't really comment on the AArch64 parts of this, but the LoopVectorizer bits look entirely reasonable to me at this point.
Simplified size checking code, changed assert on interleave factor to occur before creating recipes.
The recipe constructor doesn't actually know the VF, so just confirming that the decision for a given scalable VF is to interleave only for factors of 2 should suffice. The code that performs interleaving is still effectively guarded by asserts in call construction that it has the correct number of arguments.
LGTM, thanks!
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
449 ↗ | (On Diff #527321) | Could you make sure there's a test case for RISCV that covers this case before landing? |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
444 | can you pass here Vals directly? |
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp | ||
---|---|---|
449 ↗ | (On Diff #527321) | I added that check because both interleaved-accesses.ll and strided-accesses.ll (in llvm/test/Transforms/LoopVectorize/RISCV/) crash on the cast below when vplan tries to get the cost of the interleaving group with a scalable VF. Is that sufficient? |
I think these functions could actually be joined into 1, something like:
supportsInterleaving(VectorType *VecTy, uint32_t Factor, bool IsMasked)
I think it is going to be unlikely that target supports store but not load for the same vector type, @paulwalker-arm what do you think?