This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use SSAUpdater in PromoteAlloca
ClosedPublic

Authored by Pierre-vh on Jun 12 2023, 6:06 AM.

Details

Summary

This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Diff Detail

Event Timeline

Pierre-vh created this revision.Jun 12 2023, 6:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2023, 6:06 AM
Pierre-vh requested review of this revision.Jun 12 2023, 6:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2023, 6:06 AM

As a follow up patch should remove the sroa run from the pass pipeline

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
144

I think this is an aggressive statement, why not just preserves DominatorTree (also, I'd expect that to be implied by setPreservesCFG)

557–559

I don't understand why you would need to do this. You aren't moving any instructions around, so the dominance properties should be implied by the original values.

683

use_empty

Pierre-vh marked 2 inline comments as done.Jun 13 2023, 12:18 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
557–559

If I don't sort by dominance, I have multiples crashes in promote-alloca-array-aggregate.ll. It seems mostly related to memcpy though, hence why I put a TODO saying the lowering may be incorrect. I'm still not sure why.

I agree that requiring the DT for this is annoying, I'll try to think of ways to avoid it.

Address some comments

Pierre-vh updated this revision to Diff 530807.Jun 13 2023, 1:15 AM
Pierre-vh marked an inline comment as done.

Remove domtree dependency

I just came across this when trying to figure out what your issue is:

/// Helper class for promoting a collection of loads and stores into SSA
/// Form using the SSAUpdater.
///
**/// This handles complexities that SSAUpdater doesn't, such as multiple loads
/// and stores in one block.**
///
/// Clients of this class are expected to subclass this and implement the
/// virtual methods.
class LoadAndStorePromoter {

If you move the vector insert/extract details into an implementation of this, do the dominance issues disappear?

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

This whole thing feels wrong to me. You shouldn't need to figure out any ordering, much less by using comesBefore repeatedly

Pierre-vh added inline comments.Jun 19 2023, 4:58 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

I think that the issue was introduced because I'm looking through things such as bitcasts, which means that we can have IR like this:

bb1:
  %x = alloca ...
  %cast.x = cast %x to ...

bb2: 
  store ... %x
  store ... %cast.x

Then, when I iterate the users of the alloca and build the worklist, the worklist may have the store to %cast.x before the store to %x, so I need to sort the worklist to use SSAAUpdater.

I'm not sure how LoadStorePromoter would help, I've looked at it but it seems to be for a different purpose

arsenm added inline comments.Jun 19 2023, 5:46 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

I applied your patch and not that many tests fail. The one I'm looking at is promote_memcpy_inline_aggr. This has the multiple loads and stores in the same block case the LoadAndStorePromoter comment says the base SSAUpdater does not handle. I think this is probably the correct tool, you just have the added complexity that we're coercing the type of the accesses to vector instead of reusing the original

arsenm added inline comments.Jun 19 2023, 6:01 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
562

I think this is subtley wrong and should actually be undef. Alloca memory is currently initialized with undef, not poison

Pierre-vh updated this revision to Diff 532835.Jun 20 2023, 2:25 AM
Pierre-vh marked 3 inline comments as done.

Address some comments

Pierre-vh added inline comments.Jun 20 2023, 2:25 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

LoadAndStorePromoter would need some modifications to work here or I would need to first rewrite all of the load/stores to use vectors then run promote alloca later to do the promotion. For instance, LoadAndStorePromoter initializes SSAUpdater with the type of the first load/store

I don't mind using it if you think it's better to first rewrite all the load/stores then run LoadAndStoreUpdater, but IMO it's really not necessary, we'd make things more complex (and less direct) just to avoid a 5 lines function (I heavily simplified it)

arsenm added inline comments.Jun 21 2023, 5:19 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

I think the real question is how does LoadAndStorePromoter deal with multiple loads and stores in the same block? I doubt it's using a worklist sort. Can you use whatever technique it's using?

Pierre-vh added inline comments.Jun 22 2023, 12:20 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

It uses a system of buckets

// First step: bucket up uses of the alloca by the block they occur in.
// This is important because we have to handle multiple defs/uses in a block
// ourselves: SSAUpdater is purely for cross-block references.
DenseMap<BasicBlock *, TinyPtrVector<Instruction *>> UsesByBlock;

And if, when promoting an inst, it sees there's more than one inst in the bucket, it does a linear scan of the basic block to find & promote all users.

I used something similar at first but I just found it more complicated, when this case just a simple sorted worklist does the trick. The only important part really is that it sees users in a given basic block in program order. There's multiple ways to achieve that.

If you're more comfortable with that approach then I can use it too, and just simplify it as much as possible for our use case

arsenm requested changes to this revision.Jun 26 2023, 11:12 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
365–382

I think that makes more sense, sorting isn't ideal

This revision now requires changes to proceed.Jun 26 2023, 11:12 AM
Pierre-vh updated this revision to Diff 534867.Jun 27 2023, 1:17 AM

Use logic similar to LoadAndStoreUpdater

arsenm accepted this revision.Jun 27 2023, 3:30 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
399

Capitalize

This revision is now accepted and ready to land.Jun 27 2023, 3:30 PM
This revision was automatically updated to reflect the committed changes.
Pierre-vh marked an inline comment as done.
jplehr added a subscriber: jplehr.Jun 28 2023, 1:31 AM

Hey Pierre, I believe this broke the AMDGPU OpenMP buildbot https://lab.llvm.org/buildbot/#/builders/193

Hey Pierre, I believe this broke the AMDGPU OpenMP buildbot https://lab.llvm.org/buildbot/#/builders/193

Looking into it

Pierre-vh reopened this revision.Jun 28 2023, 2:14 AM

complex_reduction.cpp fails on GFX9 in openmp tests: https://lab.llvm.org/buildbot/#/builders/193/builds/33759

This revision is now accepted and ready to land.Jun 28 2023, 2:14 AM
Pierre-vh planned changes to this revision.Jun 28 2023, 2:15 AM

I think I figured out why the failure was happening, and it's due to improper handling of Load/Store promotion.
However LoadAndStorePromoter can't save us here because we also handle partial load/stores, i.e. we can load/store a single element of the vector. It complicates things quite a bit and we can't get away with deferring loads/stores everytime, we'll need to keep track of lanes and make decisions based on which lane is being hit.

I'm working on a fix but it might take a bit of time to get right

Fix openmp tests

In the end I didn't need to do anything complicated with lanes and such. I just had to do the same system as LoadAndStoreUpdater, which is to defer live-in loads lowering to a second pass.
Now, some dummy loads are also inserted on the first pass for instructions that must be promoted on the first pass but also need to know the current vector value.

It leads to less folding than I'd like, but it's a lot more correct.

This revision is now accepted and ready to land.Jun 29 2023, 12:29 AM
Pierre-vh requested review of this revision.Jun 29 2023, 12:30 AM
Pierre-vh planned changes to this revision.Jun 29 2023, 5:32 AM

There is still one more failure to debug on some app, an assertion can sometimes trip with castIsValid, trying to get a testcase.

It leads to less folding than I'd like, but it's a lot more correct.

Do you have an example of this?

It leads to less folding than I'd like, but it's a lot more correct.

Do you have an example of this?

Look in promote-alloca-loadstores.ll, the first testcase, one of the insertelement is useless (gets overwritten right away)
It's very minor things so I don't think it's a big issue

Pierre-vh updated this revision to Diff 538990.Jul 11 2023, 2:57 AM

Fix ptr load/stores issues + add test

arsenm added inline comments.Jul 11 2023, 5:21 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
415

Doesn't consider vector of pointers

450

Specifically use PtrToInt?

llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
89 ↗(On Diff #538990)

Add some load/store vector of pointer cases. Also mix different pointer sizes

Pierre-vh updated this revision to Diff 539976.Jul 13 2023, 5:39 AM
Pierre-vh marked an inline comment as done.

Add 32 bit pointer test + mixed test

Pierre-vh marked an inline comment as done.Jul 13 2023, 5:39 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
415

What case do you have in mind? This code path is just for loading the full vector. I initially had a test for a <1 x ptr> vector but we don't vectorize under 2 elements so it never worked.

arsenm added inline comments.Jul 13 2023, 6:45 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
415

store <2 x ptr>
load <4 x ptr addrspace(3)>

things like that

Pierre-vh updated this revision to Diff 540034.Jul 13 2023, 7:33 AM

Good catch with the ptr vector type, I wasn't able to come up with a test for it earlier so I thought it was fine, but it was indeed crashing.
Now it's fixed and I added a testcase for it.

arsenm added inline comments.Jul 13 2023, 8:34 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
487

no else after return

llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
69

Why did this change? It only uses volatile accesses

llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
127 ↗(On Diff #540034)

There was a recent bug filed that amounts to not handling this (it didn't use pointers, but just different sized vectors)

Pierre-vh marked 3 inline comments as done.Jul 14 2023, 12:10 AM
Pierre-vh added inline comments.
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll
69

We support non-simple accesses of the whole vector, it's volatile accesses of a single element that we don't support

llvm/test/CodeGen/AMDGPU/promote-alloca-loadstores.ll
127 ↗(On Diff #540034)

Yes, I just saw it. I'd rather fix this in a separate patch; this patch is already quite large and if I do too much in it I'm afraid it'll make potential issues harder to track down

I think we just need to use something else than isBitOrNoopPointerCastable. It's too limited because it doesn't take into account that we can use an intermediate cast (like the cast to int for ptr -> vec)

Pierre-vh marked 2 inline comments as done.

Remove elses after returns

Pierre-vh added inline comments.Jul 20 2023, 7:08 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
439

This will crash if we load/store a value at an offset, and the access type is the same size as the alloca.
Should we check that the index is zero as well here?

arsenm added inline comments.Jul 20 2023, 7:47 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
439

Probably, need to not break on out of bounds access

Pierre-vh marked an inline comment as done.Jul 24 2023, 12:14 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
439

Oops, wrong place, it already does it (&Ptr == &Alloca), it's something in D155699

arsenm added inline comments.Jul 24 2023, 6:14 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
391

Use poison for the filler

473

You already have the cast<MemTransferInst> as MTI

503

Dead break

Pierre-vh updated this revision to Diff 543512.Jul 24 2023, 6:23 AM
Pierre-vh marked 4 inline comments as done.

Comments

arsenm accepted this revision.Jul 24 2023, 6:26 AM
This revision is now accepted and ready to land.Jul 24 2023, 6:26 AM
This revision was landed with ongoing or failed builds.Jul 24 2023, 10:45 PM
This revision was automatically updated to reflect the committed changes.