This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Scalarize vector load/extract.
ClosedPublic

Authored by fhahn on Apr 11 2021, 2:04 PM.

Details

Summary

This patch adds a new combine that tries to scalarize chains of
extractelement (load %ptr), %idx to load (gep %ptr, %idx). This is
profitable when extracting only a few elements out of a large vector.

At the moment, , store (extractelement (load %ptr), %idx), %ptr
operations on large vectors result in huge code in the backend.

This can easily be triggered by using the matrix extension, e.g.
https://clang.godbolt.org/z/qsccPdPf4

This should complement D98240.

Diff Detail

Event Timeline

fhahn created this revision.Apr 11 2021, 2:04 PM
fhahn requested review of this revision.Apr 11 2021, 2:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2021, 2:04 PM
lebedev.ri added inline comments.Apr 11 2021, 2:15 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
790–810

Since we would be extracting from the original value,
why do we care if it would have been overwritten sometimes later?
Can't we just emit all loads right where the wide load was?

812

Do we expect that all the CSE have happened by now?

fhahn added inline comments.Apr 11 2021, 2:22 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
790–810

Can't we just emit all loads right where the wide load was?

We could, *if* the indices are all defined before the wide load. That's not the case in the cases I am most interested in.

At the moment the scalar loads are inserted at the point of the extracts, but if the index is defined before the wide load, we could create the narrow loads next to the widened loads. But that may be good as a follow up extensions?

812

Most of it yes, but not for the code that the SLP vectorizer creates. We don't run dedicated CSE later either, just instcombine.

fhahn updated this revision to Diff 336817.Apr 12 2021, 6:31 AM

Updated to match vector loads, so it is easier to re-trigger after store scalarization from D98240.

RKSimon added inline comments.Apr 12 2021, 8:17 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
901

Why are you forcing the alignment ?

fhahn updated this revision to Diff 338506.Apr 19 2021, 6:36 AM

Use commonAlignment between alignment and alignment for scalar type.

fhahn added inline comments.Apr 19 2021, 6:39 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
901

I think we have to update the alignment of the scalar load, because after applying an offset to the pointer it may not be aligned as specified on the original load. I think we should be able to use the common alignment between the alignment on the load and the scalar type. WDYT?

fhahn updated this revision to Diff 344751.May 12 2021, 3:13 AM

rebased after recent changes, restrict to constant indexes known to be valid, for now.

RKSimon added inline comments.May 12 2021, 3:19 AM
llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
645

The arg says its align 1m, but here we're saying align 8?

fhahn added inline comments.May 13 2021, 6:16 AM
llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
645

Hm that's indeed odd. I *think* the align of the load would take precedence and align 1 dereferenceable(16) are probably not needed for the test. But I think we need to choose a conservative alignment for the scalarized load, that's why align 1 is used for the scalar load.

Does that answer your question?

Anyone else got any comments?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
876

Pull out the magic number (MaxInstChecked = 6 ?)

901

Yes, using a common alignment would make more sense

spatel added inline comments.May 20 2021, 12:26 PM
llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll
253

Let me know if I'm overlooking it, but I don't see a positive test for this pattern (multiple extracts of a single load).
Is there some vector type variant (float?) of this test that will transform for AArch64?
If not, we might get x86 or some other target to trigger on this test as-is.

fhahn updated this revision to Diff 347010.May 21 2021, 6:29 AM

Updated code to use common alignment of original alignment and the scalar type alignment.

Also rebased on top of 4efb4f674cb6, which adds a new positive test where multiple extracts of the same load are scalarized

fhahn updated this revision to Diff 347014.May 21 2021, 6:36 AM

Use existing MaxInstrsToScan instead of hard-coding the limit to 6.

fhahn added inline comments.May 21 2021, 6:38 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
876

I updated the code to use MaxInstrsToScan which was added by an earlier patch.

901

I updated the code to use the common alignment.

llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll
253

Yes I think this case was missing. I added load_multiple_extracts_with_constant_idx_profitable to cover that I think.

spatel added inline comments.May 21 2021, 8:08 AM
llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
645

Some of these tests are intentionally bizarre -- blame me :) -- with respect to alignment because the rules are not clear.
Looking at the code in vectorizeLoadInsert(), we have:

// Use the greater of the alignment on the load or its source pointer.
Alignment = std::max(SrcPtr->getPointerAlignment(DL), Alignment);

So on the AVX2 target, this test is going through 2 transforms now instead of just 1:

  1. This patch kicks in 1st and converts the short vector load to scalar load with align 2 -- presumably because that's the datalayout-ABI-specified setting. Should the original load's align 8 attribute override that?
  2. The scalar load is converted to the wider vector load again in vectorizeLoadInsert(), but now the alignment was reduced, so we end up with the test diff.
fhahn updated this revision to Diff 347040.May 21 2021, 8:24 AM

Use alignment from original store when scalarizing a load with index 0.

fhahn added inline comments.May 21 2021, 8:26 AM
llvm/test/Transforms/VectorCombine/X86/load-inseltpoison.ll
645

Thanks for the explanation. I think once we lowered to alignment on the load, it is not really safe to raise it based on the alignment of the pointer type. But if we scalarize with index 0, we load from the original pointer and we can preserve the original alignment. I updated the patch and the AVX2 changes are now gone.

spatel accepted this revision.May 21 2021, 8:51 AM

LGTM

This revision is now accepted and ready to land.May 21 2021, 8:51 AM
This revision was automatically updated to reflect the committed changes.
spatel added inline comments.May 24 2021, 4:42 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
116–119

I didn't check the bot failure, but we dealt with a previous sanitizer failure with an additional predicate here.

fhahn added inline comments.May 24 2021, 4:51 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
116–119

I think the issues was that scalarizeLoadExtract was called after foldSingleElementStore, which may remove instructions. Should be fixed by 4e8c28b6fbec