This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Break-up large PHIs for DAGISel
ClosedPublic

Authored by Pierre-vh on Feb 10 2023, 5:23 AM.

Details

Summary

DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen.
This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users.
It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function).

This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs.

Fixes SWDEV-321581

Diff Detail

Event Timeline

Pierre-vh created this revision.Feb 10 2023, 5:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 5:23 AM
Pierre-vh requested review of this revision.Feb 10 2023, 5:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 5:23 AM

Also could use a switch test where the same successor phi needs to list the same predecessor twice

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1360

A few random formatting changes

1408

With the way fallback works, we can end up using the DAG and wouldn’t have used this. That’s probably ok though

1417

Power of 2 shouldn’t matter

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-scalarize-large-phis.ll
612 ↗(On Diff #496438)

Should also include some FP, 8-bit and 16-bit element cases. We’d be bette riff breaking those into 32 or 64 bit pieces than full scalarization

Pierre-vh added inline comments.Feb 10 2023, 6:23 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1360

Is it fine if I run clang-format on this file and commit the result directly to get rid of them?

arsenm added inline comments.Feb 10 2023, 6:25 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1360

If you use git clang-format it only touches diffed lines

Pierre-vh added inline comments.Feb 10 2023, 7:22 AM
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-scalarize-large-phis.ll
612 ↗(On Diff #496438)

Should I also lower the threshold for this to kick in? Else 8 bit vectors won't be interesting unless they're >32 elts

arsenm added inline comments.Feb 10 2023, 7:26 AM
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-scalarize-large-phis.ll
612 ↗(On Diff #496438)

We only really have 32-bit registers. Everything wider is a bit unnatural. GlobalISel breaks everything into 32 or 64 pieces. There may be second order benefits from a compiler machinery perspective for preserving larger pieces. There’s room for experimentation

Pierre-vh updated this revision to Diff 496880.Feb 13 2023, 2:09 AM
Pierre-vh marked 6 inline comments as done.

I kept the threshold at 256 for now because lower values make the transform kick more often and it's not always clear to me whether it's a positive or negative thing. Some tests have more instructions, some have a bit less, etc.
I think 256 is good enough to get started, we can always tweak it in a separate commit if benchmarking proves that it's useful. What do you think?

Also the following tests are currently broken:

  • test_mfma_loop_32xfloat in acc-ldst.ll: We get accvgpr read/writes instead of a load to agpr directly.
    • One incoming value from a load
    • One incoming value from the call to mfma later in the same block (loop)
  • test_vccnz_ifcvt_triangle256 in early-if-convert.ll: s_branch_vccnz is not matched anymore.
    • One incoming value from a load
    • One incoming value from a vector add
  • test_mfma_loop_zeroinit in mfma-loop.ll : accvgpr copies inside the loop
    • One incoming value is a zeroinit
    • One incoming value is from call to mfma later in the same block (loop)

I think we need an additional heuristic but I'm not sure what. Maybe we should not transform when one of the value comes from a load (we know there won't be an undef so it should be good) or comes from the same block (loop)?
I'm also not sure if extract/insert subvector with a constant operand (in the case of zeroinit) gets folded out; do I need to special-case it?

Pierre-vh retitled this revision from [AMDGPU] Scalarize some large PHIs for DAGISel to [AMDGPU] Break-up large PHIs for DAGISel.Feb 13 2023, 2:09 AM
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh updated this revision to Diff 496949.Feb 13 2023, 6:05 AM
Pierre-vh edited the summary of this revision. (Show Details)

Add an optimization for Constant incoming values.
This allows us to avoid emitting a extract.vector of a zeroinit/constant vector.

I kept the threshold at 256 for now because lower values make the transform kick more often and it's not always clear to me whether it's a positive or negative thing. Some tests have more instructions, some have a bit less, etc.
I think 256 is good enough to get started, we can always tweak it in a separate commit if benchmarking proves that it's useful. What do you think?

Also the following tests are currently broken:

  • test_mfma_loop_32xfloat in acc-ldst.ll: We get accvgpr read/writes instead of a load to agpr directly.

Ideally this is the kind of case RegBankSelect is supposed to solve

I think we need an additional heuristic but I'm not sure what. Maybe we should not transform when one of the value comes from a load (we know there won't be an undef so it should be good) or comes from the same block (loop)?
I'm also not sure if extract/insert subvector with a constant operand (in the case of zeroinit) gets folded out; do I need to special-case it?

Really this is a register bank select problem. Fundamentally we always want 32-bit phis. Any regression from not using 32-bit phis is theoretically a failure with some other optimization

arsenm added inline comments.Feb 13 2023, 5:12 PM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1415

This must be a new intrinsic, I didn't know it existed. Usually you would do a shufflevector to get the reduced elements

1438

Don't need = {}

Pierre-vh marked an inline comment as done.EditedFeb 14 2023, 12:14 AM

I kept the threshold at 256 for now because lower values make the transform kick more often and it's not always clear to me whether it's a positive or negative thing. Some tests have more instructions, some have a bit less, etc.
I think 256 is good enough to get started, we can always tweak it in a separate commit if benchmarking proves that it's useful. What do you think?

Also the following tests are currently broken:

  • test_mfma_loop_32xfloat in acc-ldst.ll: We get accvgpr read/writes instead of a load to agpr directly.

Ideally this is the kind of case RegBankSelect is supposed to solve

I think we need an additional heuristic but I'm not sure what. Maybe we should not transform when one of the value comes from a load (we know there won't be an undef so it should be good) or comes from the same block (loop)?
I'm also not sure if extract/insert subvector with a constant operand (in the case of zeroinit) gets folded out; do I need to special-case it?

Really this is a register bank select problem. Fundamentally we always want 32-bit phis. Any regression from not using 32-bit phis is theoretically a failure with some other optimization

In this case I think it's a SIFoldOperand issue (or something strange happening higher that blocks SIFoldOperands). It's not folding the AGPR copies away.
If PHIs aren't broken, we get the areg copy in the entry block:

%181:vreg_1024_align2 = REG_SEQUENCE (..vreg128s)
%1:vreg_1024_align2 = COPY %181:vreg_1024_align2
%243:areg_1024_align2 = COPY %1:vreg_1024_align2, implicit $exec

But when PHIs are broken, we get it in the loop closer to V_MFMA and after the PHIs. I suspect SIFoldOperand doesn't handle the PHI node to "propagate" the areg class up until it reaches the store.

foldOperand: %122.sub0:areg_1024_align2
  %34:vgpr_32 = PHI %1:vgpr_32, %bb.0, %67:vgpr_32, %bb.1

I'm not sure how easy it is to make SIFoldOperand handle PHIs like that.
Or maybe the fix should be earlier in the pipeline (so aregs are assigned better)? There might be an opportunity to fix this in FixSGPRCopies:

%119:sgpr_1024 = REG_SEQUENCE (vgpr phi nodes)...
%123:areg_1024_align2 = COPY %119:sgpr_1024

We don't seem to fold sgpr -> agpr copies like we fold sgpr -> vgpr

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1415

If shufflevector has better handling and gets folded out for sure then it might be simpler to use it.
Do I just do a shufflevector with the same operand twice + the mask of the indexes I want?

1438
warning: missing field 'IncomingValues' initializer [-Wmissing-field-initializers]

It's needed because I use brace initializers

Pierre-vh marked an inline comment as done.

Use unary shufflevector instead of extract_vector

arsenm added inline comments.Feb 14 2023, 5:23 AM
llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
1415

You can use poison for the second operand

Pierre-vh updated this revision to Diff 497614.Feb 15 2023, 3:15 AM
  • Add names to values created by the transform (makes spotting issues with it easier)
  • Add comment to early-if-convert
foad added a comment.Feb 28 2023, 2:12 AM

This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

Do you have any insight into *why* undef elements in build_vector would cause high register pressure? Is there any chance of fixing this elsewhere, by teaching other parts of the compiler to handle undef elements better?

This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

Do you have any insight into *why* undef elements in build_vector would cause high register pressure? Is there any chance of fixing this elsewhere, by teaching other parts of the compiler to handle undef elements better?

In the particular case I was looking at, we had a 32xi32 vector (from 11xF64 legalized) with 10 undef elements. As the final operation of the BB was a BUILD_VECTOR + CopyToReg, it forced 22 values to be alive until that point and did a wasteful 32x32 copy for this 11xf64 vector. With this change, the undef values can be optimized out and the intermediate values (elements) no longer need to be all alive at the same point. This means they can be reordered and scratch usage goes back down to zero.

Before this patch, I had looked at optimizing this at the DAG level but it's too much effort to do there. As Matt put it earlier, PHIs/Copies larger than 32 bits are "unnatural" on AMDGPU anyway so there's no real benefit in keep around large PHIs.

foad added a comment.Feb 28 2023, 2:52 AM

This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

Do you have any insight into *why* undef elements in build_vector would cause high register pressure? Is there any chance of fixing this elsewhere, by teaching other parts of the compiler to handle undef elements better?

In the particular case I was looking at, we had a 32xi32 vector (from 11xF64 legalized) with 10 undef elements. As the final operation of the BB was a BUILD_VECTOR + CopyToReg, it forced 22 values to be alive until that point and did a wasteful 32x32 copy for this 11xf64 vector. With this change, the undef values can be optimized out and the intermediate values (elements) no longer need to be all alive at the same point. This means they can be reordered and scratch usage goes back down to zero.

Before this patch, I had looked at optimizing this at the DAG level but it's too much effort to do there. As Matt put it earlier, PHIs/Copies larger than 32 bits are "unnatural" on AMDGPU anyway so there's no real benefit in keep around large PHIs.

Then why can't you scalarize them all? Why do we need yet another weird threshold in the backend?

Scalarizing everything may have some unintended side effects so that's why I left it at 256 for now. I'll check what happens when I lower it to 32 and report back.

Pierre-vh updated this revision to Diff 501080.Feb 28 2023, 3:59 AM
  • Reduce threshold to 32.
    • Lots more test changes. I disabled the transform in some of them to (hopefully) preserve the test's intent better.
    • loop-live-out-copy-undef-subrange.ll takes the biggest regression. I think we may not need to do this transform if an incoming value is an arg? Args will be copied anyway so we don't really gain anything in those cases. Thoughts?
  • Preserve debug info better
Pierre-vh updated this revision to Diff 501433.Mar 1 2023, 2:15 AM

Remove mention of "power of two" sizes

Ping - can this land?

kzhuravl accepted this revision.Mar 27 2023, 6:54 PM

lgtm, thanks!

This revision is now accepted and ready to land.Mar 27 2023, 6:54 PM
This revision was landed with ongoing or failed builds.Mar 28 2023, 12:38 AM
This revision was automatically updated to reflect the committed changes.