This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Lower uniform loads as a single load (instead of relying on CSE)
ClosedPublic

Authored by reames on Nov 12 2020, 6:40 PM.

Details

Summary

A uniform load is one which loads from a uniform address across all lanes. As currently implemented, we cost model such loads as if we did a single scalar load + a broadcast, but the actual lowering replicates the load once per lane.

This change tweaks the lowering to use the REPLICATE strategy by marking such loads (and the computation leading to their memory operand) as uniform after vectorization. This is a useful change in itself, but it's real purpose is to pave the way for a following change which will generalize our uniformity logic.

Diff Detail

Event Timeline

reames created this revision.Nov 12 2020, 6:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 12 2020, 6:40 PM
reames requested review of this revision.Nov 12 2020, 6:40 PM
reames updated this revision to Diff 305029.Nov 12 2020, 10:16 PM
reames edited the summary of this revision. (Show Details)

Rebase w/over landed tests.

As part of that, include a slight broadening of scope. For some reason, the code wasn't considering any operand which wasn't an instruction to be uniform. Since I happened to write my test with memory addresses as arguments, this fell out.

reames updated this revision to Diff 305030.Nov 12 2020, 10:19 PM

Remove stray debug output and an extra bit of whitespace.

fhahn added inline comments.Nov 14 2020, 12:17 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

The VPRecplicateRecipe contains a IsUniform flag. I think it should be possible to pass the flag through from the recipe to scalarizeInstruction. Ideally the recipes should contain all information required for code-generation to avoid having to tie code generation directly to the cost-model.

5079

nit: message for assert.

6262

nit: message for assert?

6441

this seems an unrelated refactoring, which could be split out and committed independently?

There's D68831 which is potentially related and tries to extend the logic to consider loop-invariant ops as uniform. I suppose I need to rebase it again....

reames added inline comments.Nov 14 2020, 7:54 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

I think you're misreading the code slightly. This isn't checking whether the recipe for the instruction being scalarized is uniform. It's checking whether the *input* to the instruction is uniform.

It seems like overkill to record a per operand uniform flag in the recipe?

6262

Aside from being pedantic, why? I'm happy to comply, but I don't really see any value in cases like this where it's obvious from context.

6441

Happy to do so since you seem to think the interface made sense. I'm very new to vectorizer code and didn't want to jump to that conclusion. :)

fhahn added inline comments.Nov 15 2020, 9:54 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

I think you're misreading the code slightly. This isn't checking whether the recipe for the instruction being scalarized is uniform. It's checking whether the *input* to the instruction is uniform.

Indeed, I was thinking about something slightly differently. The underlying point about ideally not having this code depend on the cost-model, but the information in the corresponding VPlan still applies.

It seems like overkill to record a per operand uniform flag in the recipe?

Yes, but fortunately I don't think that is necessary, because I think all required information is already encoded in the operands in VPlan: they should be uniform VPReplicateRecipes. Making this information accessible here is currently work-in-progress. With D91500 which I just put up, you should be able to check if User.getOperand(op) is a VPReplicateRecipe and if so, IsUniform also needs to be true.

I think it would make sense to take it one step further, and use this logic directly in VPTransformState::get: when requesting a particular lane for a uniform VPValue, we should always be able to just return lane 0 and other callers potentially could also benefit. With D91501, the changes to scalarizeInstruction should not be needed.

6262

I agree this one is borderline and a message like "must be called with a uniform memory instruction" probably does not add too much. It might make it slightly more explicit why this assertion is here for people not familiar with the code. I don't really mind either way :)

anna added inline comments.Nov 16 2020, 1:31 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

Frankly, I'm finding the code written here in review easier to read :) Mainly because my context with VPlan is limited. Do we need to block this change on two reviews landing? I'm okay either way - but we can also land this and then "clean this up" once D91500 and D91501 lands?

I think it would make sense to take it one step further, and use this logic directly in VPTransformState::get: when requesting a particular lane for a uniform VPValue, we should always be able to just return lane 0 and other callers potentially could also benefit.

@fhahn Could you pls clarify further what you mean by "other callers potentially could also benefit" ?

5039

Could this be turned into an assert and landed separately? I see all callers of addToWorklistIfAllowed either already checks for outOfScope or the instructions is already checked to be in loop.

fhahn added inline comments.Nov 18 2020, 12:28 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

Frankly, I'm finding the code written here in review easier to read :) Mainly because my context with VPlan is limited. Do we need to block this change on two reviews landing? I'm okay either way - but we can also land this and then "clean this up" once D91500 and D91501 lands?

The linked patches are 'more' code yes, but most of the code is part of current in-progress improvements and the uniform handling is mostly a nice side benefit of VPlanization.

We don't necessarily need to block this change, I just wanted to note that this is a step backwards in terms of the general direction (quite a bit of work was spent on moving any cost-model references out of code generation). It's just a small step though and we have a clear path forward to resolve this.

@fhahn Could you pls clarify further what you mean by "other callers potentially could also benefit" ?

There might be other places that request individual lanes for uniform values. Those would also benefit from only getting the first lane.

reames added inline comments.Nov 22 2020, 4:38 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

I've gone ahead and approved your D91501. I don't think it fully solves the problem, but I see little harm in it.

Your proposed approach creates a coupling between the lowering strategy chosen and the semantic property of uniformity. To some extend that might be unavoidable, but can we do better?

As an example case, consider an udiv(add X, Y), Z) where all inputs are loop invariant. If X and Y already have vector uses, we might reasonable decide that a widened vector add is the best lowering for this uniform expression. (Not sure if we do today, but it's a reasonable choice.) When considering the udiv, if we decide to replicate it (because we know it's uniform and expensive), we want to be able to extra lane 0 regardless of the lowering strategy we chose for the add.

Another approach is to directly skip the cost model and ask the legality question. That works for the example I just gave, but *doesn't* work with the way we currently handle GEPs feeding wideable loads - which is entirely handled in costing.

I suspect we need to move the GEP detection out of the cost model, but I don't fully understand the broader implications of that.

reames added inline comments.Nov 22 2020, 4:52 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5039

The check is needed for case where a uniform pointer is defined outside the loop but used inside of it. I could move the check to the caller, but the placement seems to make more sense here?

reames added inline comments.Nov 22 2020, 4:54 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

We don't necessarily need to block this change, I just wanted to note that this is a step backwards in terms of the general direction (quite a bit of work was spent on moving any cost-model references out of code generation). It's just a small step though and we have a clear path forward to resolve this.

Could I get an LGTM then?

I'm happy to help evolve towards a good overall design, but it's really hard to make progress when the first patch which has a functional effect gets blocked in review.

fhahn accepted this revision.Nov 23 2020, 1:22 PM
fhahn added a reviewer: Ayal.
fhahn added a subscriber: Ayal.

LGTM, thanks. I also added @Ayal, who might also have some thoughts. It would probably be best to wait a few days with committing, in case there are additional comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2667

Your proposed approach creates a coupling between the lowering strategy chosen and the semantic property of uniformity. To some extend that might be unavoidable, but can we do better?

Yes, the VPlan changes I put up just ask the same question in a different way. I think we should be able to improve the code generated for your example, but that should probably happen before codegen, e.g. by widening.

Another approach is to directly skip the cost model and ask the legality question. That works for the example I just gave, but *doesn't* work with the way we currently handle GEPs feeding wideable loads - which is entirely handled in costing.

Asking legality here comes with the same problematic coupling which hinders modularizing transforms via VPlan I think.

This revision is now accepted and ready to land.Nov 23 2020, 1:22 PM