This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Run unmerge combines post regbankselect
ClosedPublic

Authored by Pierre-vh on Jan 20 2023, 1:40 AM.

Details

Summary

RegBankSelect can insert G_UNMERGE_VALUES in a lot of places which
left us with a lot of unmerge/merge pairs that could be simplified.
These often got in the way of pattern matching and made codegen
worse.

This patch:

  • Makes the necessary changes to the merge/unmerge combines so they can run post RegBankSelect
  • Adds relevant unmerge combines to the list of RegBankSelect combines for AMDGPU
  • Updates some tablegen patterns that were missing explicit cross-regbank copies (V_BFI patterns were causing constant bus violations with this change).

This seems to be mostly beneficial for code quality.

Diff Detail

Event Timeline

Pierre-vh created this revision.Jan 20 2023, 1:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 1:40 AM
Pierre-vh requested review of this revision.Jan 20 2023, 1:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2023, 1:40 AM
arsenm added inline comments.Jan 20 2023, 5:52 AM
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
1767–1775

I'm pretty sure we have a helper for this already (at least the artifact combiner handles this already)

1771–1774

Should go through one pair of getRegClassOrRegBank calls

llvm/lib/Target/AMDGPU/SIInstructions.td
2050–2053

I think this needs to go off the a predicate. If we have to generate so many copies it's potentially worse than matching the pattern

arsenm added inline comments.Jan 20 2023, 9:22 AM
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
1767–1775

What I was thinking of was canReplaceReg. We should have another flavor that only accepts virtual registers and inserts the copy if needed

Pierre-vh marked an inline comment as done.

Comments

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
1767–1775

It's 3 lines now, do we need to create a helper?
If yes, what should we name it? Do we just create an overload of canReplaceReg that takes a Builder?

llvm/lib/Target/AMDGPU/SIInstructions.td
2050–2053

Do you mean they should be on the matching part (DivergentBinFrag<...) and not on the output pattern?
For now this doesn't look like it really has cases where too many copies are emitted, I think SIFoldOperand will fold most of them.

I don't remember how we want to handle this in GISel. Do we want to be aware of constant-bus limitations at matching time, or ignore it/always respect it by greedily inserting copies and cleaning it up later (with SIFoldOperands and an eventual improved version)
To me the second strategy looks simpler and just as powerful but I don't quite remember if we made a decision there

FWIW, I was thinking about adding a "Finalizer" method to GISel Pattern Matching to allow targets to call some C++ code before the pattern is applied/instructions are built so it can "veto" the pattern if needed and fail if it's non-profitable for instance. I didn't make an RFC yet but if it sounds like something that would be useful I can make one in the coming days/week.

Pierre-vh updated this revision to Diff 495742.Feb 8 2023, 12:10 AM

Rebase + ping review

arsenm added inline comments.Feb 9 2023, 6:03 AM
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
1767–1775

I guess not. However, I’m not sure how scalable this strategy will be as we add more regbank combines. I guess we can go with this for now and see how it works out

llvm/lib/Target/AMDGPU/SIInstructions.td
2050–2053

It's partially an open question for how to handle the constant bus problem. The current strategy is supposed to be let regbankselect aggressively emit copies to VGPR up front so it's impossible to violate, which SIFoldOperands can clean up.

In the case of patterns, I think it would be worse if we had to manually write all of them out in C++ to handle them in SIFoldOperands. Selection patterns should be applying logic to avoid violating it. The finalizer sounds like the same as the current arbitrary code predicates?

Pierre-vh marked an inline comment as done.Feb 9 2023, 6:17 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/SIInstructions.td
2050–2053

So what I did here follows the current strategy, right? It aggressively copies to VGPRs and lets SIFoldOperands clean up. I suppose an alternative could be to add some PatFrag(s) with GISel code with a heuristic to prevent matching if more than X copies cannot be folded but it feels fragile (because we'd need an arbitrary limit). What do you think?

I also don't think there's a test case here where matching BFI leads to worse code due to too many copies. I think the odds of most of the copies being folded-out are pretty good. There's sometimes one left but it's better than having 3 or 4 instructions to do what BFI can do in one.

arsenm accepted this revision.Feb 9 2023, 6:20 AM

Does the instruction matcher try to optimize the case where the class is already VGPR and avoid introducing a new instruction?

This revision is now accepted and ready to land.Feb 9 2023, 6:20 AM

Does the instruction matcher try to optimize the case where the class is already VGPR and avoid introducing a new instruction?

It doesn't look like it does at first glance (it just emits 3 BuildMI/AddTempRegister sequence). The copies are very likely folded out later but I'm not sure where

This revision was automatically updated to reflect the committed changes.

Does the instruction matcher try to optimize the case where the class is already VGPR and avoid introducing a new instruction?

It doesn't look like it does at first glance (it just emits 3 BuildMI/AddTempRegister sequence). The copies are very likely folded out later but I'm not sure where

I’m sure they are folded out by PeepholeOpt, but as a compile time optimization we should try to avoid creating unnecessary instructions