This is an archive of the discontinued LLVM Phabricator instance.

[LV] Explicily lower uniform load as single load
AbandonedPublic

Authored by reames on Aug 29 2022, 3:11 PM.

Details

Summary

This replaces the "scalarize then clone-as-uniform" lowering for loads of uniform values with a dedicated mode of memory widening for this case. This avoids needing to relying on instcombine/GVN to clean up after redundant loads.

Note: I plan to do stores too. This patch includes some of the API plumbing on the store side, but doesn't actually use it yet. I thought the consistency in the API was worth a bit of temporarily nop code.

Diff Detail

Event Timeline

reames created this revision.Aug 29 2022, 3:11 PM
reames requested review of this revision.Aug 29 2022, 3:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 3:11 PM

I think this looks like a nice, logical improvement. Having a new uniform widening decision seems sensible. I have a few mostly minor comments, but I was a little surprised by a couple of test changes.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4262

nit: White space change.

4712

I'm not sure what the "if legal" bit refers to here? If we're treating this as uniform then it's because Legal->isUniformMemOp has returned true. I think you could just say:

// Only try to scalarize the uniform memop itself if we're not using the direct lowering ...
9746

Not sure what this comment means?

9783

I think we normally write this as

Value *Addr = State.get(getAddr(), VPIteration(0, 0));

which is what we've done elsewhere.

9785

Looks like this can fit on the end of the line above?

9786

Not sure I understand this comment? Why does this involve a reverse shuffle?

llvm/lib/Transforms/Vectorize/VPlan.h
1774

nit: I think this should have a simple comment like the functions above.

llvm/test/Transforms/LoopVectorize/X86/cost-model-assert.ll
26

This looks a bit unexpected. Do you know why the minimum iteration check has changed here from 8 to 4? It sounds like we've made a different choice of VF and/or interleave count (IC). I can only assume that the cost of some instructions has changed?

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll
476

Do you know why this has changed? As far as I understand it, 'CLONE' literally means clone the scalar instruction once, whereas REPLICATE means insert copies into each lane of the vector. So in a sense we're now doing more loads than before I think?

This replaces the "scalarize then clone-as-uniform" lowering for loads of uniform values with a dedicated mode of memory widening for this case. This avoids needing to relying on instcombine/GVN to clean up after redundant loads.

IIUC the main issue is that VPReplicateRecipe::isUniform means 'uniform-per-part/uniform-per-VF', not 'uniform-across-all-parts'. Unfortunately the terminology is not really used very consistently across different parts of LV. We should be able to catch some cases through improving lowering for uniform VPReplicateRecipes directly: D133019.

If we want to use isUniformMemOp to determine the uniform-across-all-parts property, I think it would make sense to extend VPReplicateRecipe to distinguish between our 2 versions of uniformity. AFAIK VPWidenXXX naming scheme is meant to indicate that the recipe will be widened (i.e. a wide vector instruction will be generated), but for uniform loads we only generate a single scalar instruction).

reames abandoned this revision.Aug 31 2022, 1:20 PM

This replaces the "scalarize then clone-as-uniform" lowering for loads of uniform values with a dedicated mode of memory widening for this case. This avoids needing to relying on instcombine/GVN to clean up after redundant loads.

IIUC the main issue is that VPReplicateRecipe::isUniform means 'uniform-per-part/uniform-per-VF', not 'uniform-across-all-parts'. Unfortunately the terminology is not really used very consistently across different parts of LV. We should be able to catch some cases through improving lowering for uniform VPReplicateRecipes directly: D133019.

I went ahead and accepted that one. I don't really care which approach we use here. This one felt slightly cleaner to me when considering store handling, but a) we can revisit if desired when adding stores, and b) incremental progress is good.

If we want to use isUniformMemOp to determine the uniform-across-all-parts property, I think it would make sense to extend VPReplicateRecipe to distinguish between our 2 versions of uniformity. AFAIK VPWidenXXX naming scheme is meant to indicate that the recipe will be widened (i.e. a wide vector instruction will be generated), but for uniform loads we only generate a single scalar instruction).

The result of the operation is still the widened vector type. It just happens to come from a splat of a single load. That would be my argument for this approach over yours, but as I said, I really don't care.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4712

We can't scalarize a scalable uniform store if the value being stored is not loop invariant. Because we don't know which lane to store, and the general scalarization support does not handle predicated scalable vectors.

9746

Stray change. This is the token I use for easy local search, should have been removed before posting.

9786

I coped the comment and forgot to update it. Will fix.

llvm/test/Transforms/LoopVectorize/X86/cost-model-assert.ll
26

Exactly. If I remember correctly, this is because uniform instructions are never considered possible to scalarize, and I didn't add the same check for the CM_Uniform. I'd decided it didn't really matter since this test wasn't about codegen (from the name), and was instead a no-crash test.

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll
476

This one is definitely the profitable to scalarize point explained just above. I dug into this one in some depth.