This is an archive of the discontinued LLVM Phabricator instance.

[AggressiveInstCombine] folding load for constant global patterened arrays and structs by alignment
ClosedPublic

Authored by khei4 on Feb 20 2023, 8:10 PM.

Details

Summary

Fold loadInst for constant aggregate types, those load results are equivalent for the given alignments.

This revision will be part 1 of the following stages.

  1. alignment-based (This revision will be)
  2. GEP-based https://reviews.llvm.org/D146622
  • add ConstantFoldLoadFromPatternedAggregate method to AggressiveInstCombine

alive proofs: https://alive2.llvm.org/ce/z/qBGl72
Depends on: https://reviews.llvm.org/D145355
Fixes: https://github.com/rust-lang/rust/issues/107208.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
khei4 updated this revision to Diff 500393.Feb 25 2023, 1:42 AM
khei4 retitled this revision from (WIP)[ConstantFold][InstSimplify] folding load for constant global patterened arrays and structs to [ConstantFold][InstSimplify] folding load for constant global patterened arrays and structs.

update failed tests

khei4 added a comment.Feb 25 2023, 1:49 AM

add inline comments for check and review.

llvm/lib/Analysis/ConstantFolding.cpp
778 ↗(On Diff #500393)

although I'm not sure it's possible to fold pointer structs and arrays, note here.

813 ↗(On Diff #500393)

Conversion from bytes to the load type seems a bit ugly. Are there any better ways?

llvm/test/CodeGen/AMDGPU/lower-module-lds-via-hybrid.ll
22 ↗(On Diff #500393)

update these tests llvm/test/CodeGen/AMDGPU/lower-module-lds-via-table.ll and llvm/test/CodeGen/AMDGPU/lower-module-lds-via-hybrid.ll by following commands

llvm/utils/update_test_checks.py --opt-binary ./build/bin/opt \
                                                                llvm/test/CodeGen/AMDGPU/lower-module-lds-via-table.ll
llvm/utils/update_llc_test_checks.py --llc-binary ./build/bin/llc \
                                                                llvm/test/CodeGen/AMDGPU/lower-module-lds-via-table.ll

and tweak head comments, as comments in these files say
Is this ok?

llvm/test/Transforms/InstSimplify/load-patterned-aggregates.ll
52

these tests are dupplicate to the following tests, sorry for the noises

khei4 added a subscriber: JonChesterfield.

@JonChesterfield Hi. Sorry for the sudden review requests. I modified the test files you recently modified. I don't know how to check diffs in this revision are correct or not.
Could you take a look at whether these changes are no problem? If you have any idea about these diffs please teach me, and I'll fix these diffs. :)

khei4 edited the summary of this revision. (Show Details)Feb 25 2023, 6:51 PM
khei4 updated this revision to Diff 500492.Feb 25 2023, 9:27 PM

add pre-increment and overflow check

khei4 updated this revision to Diff 500497.Feb 25 2023, 9:47 PM

early return and format

khei4 planned changes to this revision.EditedFeb 26 2023, 2:38 AM
khei4 removed reviewers: JonChesterfield, arsenm.

@JonChesterfield @arsenm sorry for being noisy. I noticed my transformation was broken for ptrtoinst applied pointer arrays. I'll fix them.

khei4 updated this revision to Diff 500556.EditedFeb 26 2023, 5:49 AM

see ReadDataFromGlobal return value.
now ready for review

khei4 edited the summary of this revision. (Show Details)Feb 26 2023, 9:46 PM
nikic added inline comments.Feb 27 2023, 8:05 AM
llvm/lib/Analysis/ConstantFolding.cpp
781 ↗(On Diff #500556)

This looks like a leftover from a previous implementation? I don't think your current one has any limitations regarding the type of the initializer.

786 ↗(On Diff #500556)

If we have a 64K global and are reading 4 byte elements (i32) from it, then we need to check that 16k elements are the same. This is too expensive, as the compile-time regression shows.

We should limit this more aggressively to achieve reasonable compile-times, e.g. to 1K globals.

794 ↗(On Diff #500556)

Rather than ReadDataFromGlobal(), we should use ConstantFoldLoadFromConst() here (or rather in the loop below). This has two advantages: It's going to handle all types (including pointer loads), and it allows an early bailout if not all elements are the same. Your current implementation will always read the entire global, even if it's clear after two elements that it's not uniform.

It will also take care of endianness itself.

798 ↗(On Diff #500556)

If the load alignment is used, you also need to make sure that the global alignment is >= the load alignment. Otherwise it doesn't really tell you anything.

Though I'm not sure we really want to use alignment to determine access stride -- this is pretty unusual. Based this on the getelementptr stride would be more typical and handle cases that alignment can't (e.g. non-power-of-two stride).

llvm/lib/Analysis/InstructionSimplify.cpp
6640 ↗(On Diff #500556)

hasUniqueInitializer() is only relevant if you want to modify an initializer, it should not be checked here.

6651 ↗(On Diff #500556)

It might make sense to move this to the else of the below branch. If we know a constant offset, then we don't need to go through this code.

llvm/test/Transforms/InstSimplify/load-patterned-aggregates.ll
0

Rather than hardcoding the data layout, it's better to do something like this:

; RUN: opt < %s -passes=instsimplify -S -data-layout="e" | FileCheck %s --check-prefixes=CHECK,LE
; RUN: opt < %s -passes=instsimplify -S -data-layout="E" | FileCheck %s --check-prefixes=CHECK,BE

This will check both little and big endian.

nikic added a comment.Feb 27 2023, 8:27 AM

it might be better to move this to aggressive inst-combine or refactor

Based on those numbers, it does seem pretty likely that we'll have to move this into AggressiveInstCombine. That would also give a bit more leeway wrt globals sizes. And I don't think it would really reduce optimization potential in any meaningful way -- this doesn't seem like the kind of fold that benefits from being run 20 times in the optimization pipeline.

khei4 planned changes to this revision.Feb 28 2023, 5:06 AM

Thank you for the review!

llvm/lib/Analysis/ConstantFolding.cpp
781 ↗(On Diff #500556)

I might have cared padding bit, but it's no longer needed if I use ConstantFoldLoadFromConst

786 ↗(On Diff #500556)

Sounds reasonable. I should have estimated it.

794 ↗(On Diff #500556)

Rather than ReadDataFromGlobal(), we should use ConstantFoldLoadFromConst() here (or rather in the loop below). This has two advantages: It's going to handle all types (including pointer loads), and it allows an early bailout if not all elements are the same. Your current implementation will always read the entire global, even if it's clear after two elements that it's not uniform.
It will also take care of endianness itself.

Thank you! I couldn't notice that function. That sounds much more efficient than the current whole global variable reading.

798 ↗(On Diff #500556)

If the load alignment is used, you also need to make sure that the global alignment is >= the load alignment. Otherwise it doesn't really tell you anything.

Right. I'll handle that case.

Based this on the getelementptr stride would be more typical and handle cases that alignment can't (e.g. non-power-of-two stride)

Ok, I am a little confused. Overall I thought I don't need to care about the case that alive decides any value could be returned. Although I didn't, undef value could be returned?

  • if Global Variable's alignment > load's alignment, then whatever offsetted pointer is given to load, then any value could be returned.

https://alive2.llvm.org/ce/z/5_EDLB

  • even if Global Variable's alignment <= load's alignment, the offset is not the multiple of load's alignment, then any value could be returned.

https://alive2.llvm.org/ce/z/Qnon4K

based on these assumptions, I used alignment to calculate possible valid accessible offsets. Are these assumptions and alive results correct?
I might have misunderstood or broadly interpret alive results.

But I'm feeling it's also good to compute stride from GEP and use a bigger one. A possible minimum stride can be gained by calculating the greatest common divisor of GEP-type sizes, so it's simpler than I expected.
(This might be out of scope, it might be better not to use GEP idx, I saw your proposal )

llvm/lib/Analysis/InstructionSimplify.cpp
6640 ↗(On Diff #500556)

Yeah, you're also right here. I removed it. Thanks!

6651 ↗(On Diff #500556)

Sounds reasonable. I'll fix.

llvm/test/Transforms/InstSimplify/load-patterned-aggregates.ll
0

Cool! Thanks!

khei4 updated this revision to Diff 501805.Mar 2 2023, 1:35 AM

Applied feedback

currently pushed and waiting to see the compile-time regressions expecting improvement ;)

https://llvm-compile-time-tracker.com/

khei4 added a comment.Mar 2 2023, 1:38 AM

leave some comments to consider.

llvm/lib/Analysis/ConstantFolding.cpp
773 ↗(On Diff #501805)

Is it better to implement on GEPOperator?

798 ↗(On Diff #500556)

Is the following case the example that only the seeing only GEP stride can't fold possible case?
https://alive2.llvm.org/ce/z/9cVcee

khei4 planned changes to this revision.EditedMar 2 2023, 2:38 AM

it might be better to move this to aggressive inst-combine or refactor

Based on those numbers, it does seem pretty likely that we'll have to move this into AggressiveInstCombine. That would also give a bit more leeway wrt globals sizes. And I don't think it would really reduce optimization potential in any meaningful way -- this doesn't seem like the kind of fold that benefits from being run 20 times in the optimization pipeline.

https://llvm-compile-time-tracker.com/compare.php?from=6f3baf43820680841b0daebdde2c78b43175444b&to=c08adfbed285199cd839da40419f918ac7a863b1&stat=instructions:u

@nikic
Thanks to your feedback, regression relaxed, but yet seems so much to stay here. I'll move this to AggressiveInstCombine.

khei4 retitled this revision from [ConstantFold][InstSimplify] folding load for constant global patterened arrays and structs to [AggressiveInstCombine] folding load for constant global patterened arrays and structs.Mar 5 2023, 10:44 PM
khei4 edited the summary of this revision. (Show Details)
khei4 edited the summary of this revision. (Show Details)Mar 5 2023, 11:14 PM
khei4 updated this revision to Diff 502532.Mar 5 2023, 11:33 PM

move folding to AggressiveInstCombine

khei4 updated this revision to Diff 502556.Mar 6 2023, 1:36 AM

change the size bounds to 4K and add comments

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
904

according to this alive results https://alive2.llvm.org/ce/z/5_EDLB

when load alignment is bigger, seems to be fine to return poison value

if (GV->getAlign().valueOrOne().value() < LoadAlign) {
  I.replaceAllUsesWith(PoisonValue::get(LoadTy));
  return true;
}

but the following test failed. (on check-llvm only targeted for X86)
llvm/test/Transforms/PhaseOrdering/X86/nancvt.ll

is this alive result wrong?

nikic added inline comments.Mar 8 2023, 1:00 PM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
876

This isn't the right way to compute the GEP stride: GEP works on alloc sizes rather than store sizes, and can have multiple indices. You can use the collectOffset() methods to handle all the cases, and then GreatestCommonDivisor() to calculate the GCD.

894

This needs to bail out for volatile loads.

901

This is still checking for unique initializers -- should only check definitive initializer.

909

Even if we can't use the alignment, we can still use the GEP stride.

921

Can just omit the Offset argument here (it's zero by default).

929

Do we need the LoadSize adjustment here? Can we instead iterate to ByteOffset < GVSize?

The current LoadSize calculation is not right for pointers, but it seem like we can just drop it entirely.

933

It's okay to just use 64 as the APInt size in this context. You are currently using the size of the initializer, which will make for a *very* wide integer...

khei4 updated this revision to Diff 503698.Mar 9 2023, 1:47 AM

apply feedbacks.

khei4 updated this revision to Diff 503699.Mar 9 2023, 1:50 AM

update tests

khei4 planned changes to this revision.EditedMar 9 2023, 1:53 AM

Thank you for the review! I'm sorry about the messy changes!
also add ptr array and constant offsets tests. https://reviews.llvm.org/D145355

I'm now wondering to handle pointers. And mentioned behavior for cross-bounding load and bigger load variable alignment. I'll figure it out!

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
876

You can use the collectOffset() methods to handle all the cases, and then GreatestCommonDivisor() to calculate the GCD.

These methods save me! thanks!

901

Thanks for saying again...

908–910

Current implementation cannot read ptr array

909

Done.
(just for my understanding, I'll clarify the case for bigger load alignment in discord.

909–910

Currently

929

Do we need the LoadSize adjustment here? Can we instead iterate to ByteOffset < GVSize?

I think bound-crossing load cannot be excluded, without LoadSize.
like https://alive2.llvm.org/ce/z/jYsLBk
Sorry, I'm not confident with the semantics of this. I'll try to ask in discord.

933

Thanks! It's embarrassing...

khei4 updated this revision to Diff 505374.Mar 14 2023, 11:03 PM

refactoring

  • remove LoadSize
  • rename variables
  • fix intermediate APInt bit size to GVSize * 8

now ready for review again

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
871–873

are there any better way to bitcast APInt?

929

I removed loadSize in the end

khei4 updated this revision to Diff 505377.Mar 14 2023, 11:12 PM

fix awkward indent

khei4 planned changes to this revision.Mar 16 2023, 2:16 AM

Sorry, I noticed test failed.

khei4 updated this revision to Diff 505733.Mar 16 2023, 2:20 AM

fix wrong argument

nikic requested changes to this revision.Mar 19 2023, 7:54 AM

Extracting the stride from GEPs turned out to be trickier than I expected...

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
834

Should add a negative test using volatile.

845

This comment looks out of place, should be further down?

849

64K -> 4K

851

uint64_t

857

This is again creating a very, very wide integer :) The correct size to use here is DL.getIndexTypeSizeInBits(PtrOp->getType()) as you already do in the loop. In that case you also don't need the separate COff variable, you can let collectOffset() directly add onto this one.

859

This looks somewhat dubious -- I assume the intention here is that this is a "no-op" value for the first GreatestCommonDivisor call?

I think it would be clearer to make Stride an std::optional and then set it the first time, and use GCD for later iterations. (After the loop, if Stride is not set, it's okay to just bail out: The constant-offset case is already handled by other code, so we needn't bother with it.)

882

indeces -> indices

884

Hm, what happens if ConstOffset itself is negative? Would urem produce correct results in that case? Let's say ConstOffset = -3 and Stride = 3. Let's assume 8-bit address space to keep things simple, then ConstOffset = 253 in unsigned interpretation and 253 % 3 == 1.

Also I just realized another problem here: Just using the GCD isn't right if the GEP is not inbounds (i.e. multiplication can overflow) and the stride is not a power of two.

I believe the correct way to handle these cases is implemented in BasicAA, in particular https://github.com/llvm/llvm-project/blob/47aa1fe376a477939a1991ffe37504124af25f52/llvm/lib/Analysis/BasicAliasAnalysis.cpp#L1135-L1138 to only keep a power of two factor for non-inbounds and https://github.com/llvm/llvm-project/blob/47aa1fe376a477939a1991ffe37504124af25f52/llvm/lib/Analysis/BasicAliasAnalysis.cpp#L1168-L1170 for the final modulus.

885

A correctness check missing here is that after looking through all the GEPs, you must arrive back at GV. Otherwise there might be some kind of other operation sitting in between which preserves the underlying object, but where we can't determine the offset. As a test case, one could use a call to llvm.ptrmask for example.

894

This probably also needs to check that ConstOffset is zero? I don't think using load alignment as stride is right with a non-zero start offset.

898

It would be a bit simpler to keep using (and adding to) ConstOffset here, as we need an APInt anyway.

This revision now requires changes to proceed.Mar 19 2023, 7:54 AM
nikic added a comment.Mar 19 2023, 8:03 AM

Extracting the stride from GEPs turned out to be trickier than I expected...

If you like, we can go back to just the alignment handling for this patch, and then add the GEP handling in a separate one on top. Either way works for me.

khei4 planned changes to this revision.Mar 20 2023, 12:09 AM

Thank you for a lot of good catches!

Extracting the stride from GEPs turned out to be trickier than I expected...

If you like, we can go back to just the alignment handling for this patch, and then add the GEP handling in a separate one on top. Either way works for me.

Sounds good to me!
Then I want to separate this into following

  1. alignment-based (This revision will be)
  2. (all) inbounds GEP index-based
  3. non-inbounds included GEP index-based

TBH, I feel non-inbounds GEP handling is a little bit handful, but I'll read AA's implementation and try it first!

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
857

Oh... I was confused with offset and the GV byte sizes! You're right!

859

This looks somewhat dubious -- I assume the intention here is that this is a "no-op" value for the first GreatestCommonDivisor call?

Yes!

I think it would be clearer to make Stride a std::optional and then set it the first time,

Sounds reasonable, I'll use that!

884

Ah, I mistakenly assumed inbounds! Without inbounds, ConstantOffset could be negative and could overflow!

885

Good catch! I haven't known underlying-object preserved operations! I'll add tests!

894

Good catch! Yeah, we should have considered GEP-stride-based and alignment-based separately also for ConstOffset.

khei4 updated this revision to Diff 506490.Mar 20 2023, 1:15 AM
khei4 retitled this revision from [AggressiveInstCombine] folding load for constant global patterened arrays and structs to (WIP) [AggressiveInstCombine] folding load for constant global patterened arrays and structs.
khei4 edited the summary of this revision. (Show Details)

update summary and address simple fixes.

khei4 updated this revision to Diff 506516.Mar 20 2023, 3:03 AM

addresss

  • ptrmasked pointer failures
  • not inbounds GEP

and refactoring

  • use APInt in loop
  • fix BitWidth for APInt
  • fix typo
nikic requested changes to this revision.Mar 21 2023, 7:33 AM

Nearly there, I think...

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
832

Unnecessary/unusual newline.

839

getPointerOperand()` is a bit more elegant.

846

What you want to do is after the GEPOperator loop, check that the final PtrOpV is GV. This makes sure there were only GEPs.

855

"to avoid scanning too much memory" maybe -- we're not really allocating anything here.

865

This shouldn't be needed when using std::optional.

870

I don't really get what this comment is trying to say. The reason why non-inbounds GEP is tricky to handle is that there is an implicit modulo the address space size, which makes it harder to calculate the GCD.

872

You should probably extract the whole "get stride of GEP" logic into a separate function. That way, all these early bailouts don't apply for the alignment case (where we don't care about inbounds).

This revision now requires changes to proceed.Mar 21 2023, 7:33 AM
khei4 planned changes to this revision.EditedMar 21 2023, 9:55 PM

Thank you for the review! I may fuse the GEP-related plans I proposed! because I might have misunderstood non-inbounds GEP with correct alignment!

llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
846

Thanks! I'll check that before and after the loop!

865

I might have missed some patterns, but I guess it's necessary because GCD (self-referencing) calculations need its initial value as the first type size. :)

Stride = APIntOps::GreatestCommonDivisor(
    Stride,
    APInt(ConstOffset.getBitWidth(), IndexTypeSize.getZExtValue()));
872

(with above comment)
Ah, OK I might misunderstand the behavior of the load with non-inbounds GEP indices, but correct alignment!

khei4 retitled this revision from (WIP) [AggressiveInstCombine] folding load for constant global patterened arrays and structs to (WIP) [AggressiveInstCombine] folding load for constant global patterened arrays and structs by alignment.Mar 22 2023, 1:05 AM
khei4 edited the summary of this revision. (Show Details)Mar 22 2023, 1:23 AM
khei4 updated this revision to Diff 507300.Mar 22 2023, 3:30 AM
khei4 retitled this revision from (WIP) [AggressiveInstCombine] folding load for constant global patterened arrays and structs by alignment to [AggressiveInstCombine] folding load for constant global patterened arrays and structs by alignment.

rebase to be only alignment-based analysis. GEP-based revision will be created.

khei4 updated this revision to Diff 507309.Mar 22 2023, 3:59 AM

make ConstOffset be not optional

nikic added inline comments.Mar 22 2023, 4:09 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
861

These isa checks aren't needed in this patch: For the alignment case, we don't care what the operand is.

llvm/test/Transforms/AggressiveInstCombine/patterned-load.ll
86

I don't get how this one folds with just this patch. If align 1 we have stride 1, but constarray1 needs a stride of 2, no?

Allen added a subscriber: Allen.Mar 22 2023, 4:38 AM
khei4 added inline comments.Mar 22 2023, 4:48 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
861

Seems like, I could not get the point of ptrmask cares! (Precisely, it seems like we need to check alignment-based and GEP-based separately for completeness) Thanks!

llvm/test/Transforms/AggressiveInstCombine/patterned-load.ll
86

Ah, sorry seems like I just forgot to commit test changes...

khei4 updated this revision to Diff 507312.Mar 22 2023, 4:51 AM

update tests, remove redundant isa check

nikic added inline comments.Mar 22 2023, 7:02 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
856–881

Some simplifications.

I think as far as foldPatternedLoads is concerned, we don't need the std::optional around Stride, because a Stride of 1 is always conservatively correct. It's only relevant internally while calculating GEP stride in the next patch.

857
khei4 updated this revision to Diff 507385.Mar 22 2023, 9:09 AM

apply suggestions

khei4 added inline comments.Mar 22 2023, 9:12 AM
llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
856–881

Thanks! Yeah, simplicity for this patch matters!

nikic accepted this revision.Mar 23 2023, 4:40 AM

LGTM

This revision is now accepted and ready to land.Mar 23 2023, 4:40 AM
khei4 updated this revision to Diff 507735.Mar 23 2023, 7:28 AM

fix wrong arrow operator