This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect
ClosedPublic

Authored by Petar.Avramovic on Jan 26 2021, 3:12 AM.

Details

Summary

RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.

Diff Detail

Event Timeline

Petar.Avramovic requested review of this revision.Jan 26 2021, 3:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2021, 3:12 AM

This should also apply to conditional branches, but the user doesn't actually matter. This is really a generic combine for (trunc (bool_ext_type bool_producer))

llvm/lib/Target/AMDGPU/AMDGPUCombine.td
44

The select part here isn't essential, but the matcher here wants a specific opcode. I guess you could bypass the generated combine matcher and just call this combine in a switch, or at least add the relevant G_BRCOND user

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
90 ↗(On Diff #319258)

Changing the instruction without notifying the observer

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-uniform-icmp-select.mir
37–62 ↗(On Diff #319258)

Most of these instructions aren't relevant to the combine. You can also directly emit copies from 64-bit SGPRs even though we emit them normally as separate 32-bit copies

116–142 ↗(On Diff #319258)

Ditto

foad added inline comments.Jan 26 2021, 8:10 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
86 ↗(On Diff #319258)

Can you call the variable ICmp instead of ICMP?

foad added inline comments.Jan 29 2021, 3:02 AM
llvm/lib/Target/AMDGPU/AMDGPUCombine.td
40

You don't need AMDGPURegBankCombinerHelper::UniformICmpSelectMatchInfo at all. You can just use Register. See D95645 for an example.

Petar.Avramovic retitled this revision from AMDGPU/GlobalISel: Combine uniform icmp with select to AMDGPU/GlobalISel: Combine uniform icmp with one use.
Petar.Avramovic edited the summary of this revision. (Show Details)

Addressed review comments.

arsenm added inline comments.Feb 1 2021, 3:58 PM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
76–77 ↗(On Diff #320478)

I don't think this really needs a one use check. Consider the case where another use already exists without the intermediate casts:

%x:s32 = G_ICMP
%y:s1 = G_TRUNC %x
%z:s32 = G_ZEXT %y
%select0 = G_SELECT %x
%select1 = G_SELECT %z

foad added a comment.Feb 2 2021, 4:09 AM

I don't understand why this needs to be AMDGPU-specific, and why it only works for uniform values, and why it only works inside G_SELECT and G_BRCOND.

Can't you have a generic combine that simplifies (zext (trunc x)) -> x if the types match and the high bits of x are known to be zero?

Petar.Avramovic added inline comments.Feb 2 2021, 4:40 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
76–77 ↗(On Diff #320478)

I wanted to keep this as simple as possible and cover most common case (single use).
Can we get mir like that? Regbankselect independently legalizes sgpr icmp and select. ICMP gets followed by G_TRUNC and G_SELECT has to G_ZEXT input condition. I would expect something like this:

%x:s32 = G_ICMP
%y:s1 = G_TRUNC %x
...
%z:s32 = G_ZEXT %y
%select1 = G_SELECT %z
...
%w:s32 = G_ZEXT %y
%select0 = G_SELECT %w

Maybe we could cover this case for select, are there more?
Afaik only trunc can end up having multiple uses. So we have to check G_TRUNC uses and find the one that Helper.dominates other uses. If this happens to be our G_ZEXT then move icmp before select and trunc after select, like this:

...
%z:s32 = G_ZEXT %y //dead
%x:s32 = G_ICMP
%select1 = G_SELECT %x
%y:s1 = G_TRUNC %x
...
%w:s32 = G_ZEXT %y
%select0 = G_SELECT %w
llvm/test/CodeGen/AMDGPU/GlobalISel/combine-move-uniform-icmp-with-one-use.mir
34–35 ↗(On Diff #320478)

Combine targets specific case when there are instructions between uniform G_ICMP and G_SELECT/G_BRCOND. Zext and trunc are there but combining them has no effect on having to temporary save scc. We have to move icmp. Also since Zext and trunc have no effect on selected instruction we don't move them and just leave them to be removed as dead instructions.

arsenm added inline comments.Feb 2 2021, 7:07 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
76–77 ↗(On Diff #320478)

If the builder is CSEing, you could end up with the same trunc used multiple times

Petar.Avramovic retitled this revision from AMDGPU/GlobalISel: Combine uniform icmp with one use to AMDGPU/GlobalISel: Combine uniform icmp.
Petar.Avramovic edited the summary of this revision. (Show Details)

Handle some cases with many uses. Adding icmp fold without move for the case when we can't move icmp because code looks nicer in the case with more than one use.

Petar.Avramovic added inline comments.Feb 2 2021, 8:06 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/move-uniform-icmp.ll
24–26 ↗(On Diff #320792)

This looks pretty much same as before without uniform_icmp combine.

arsenm added inline comments.Feb 3 2021, 10:13 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
123–126 ↗(On Diff #320792)

I don't think this should be trying to find defs and move them. If we're CSEing, just creating the instruction you need would get the desired result

Petar.Avramovic edited the summary of this revision. (Show Details)

Use zext_trunc_fold from generic combiner to separately fold all cases of zext(trunc x) -> x made by regbankselect.
icmp move before select/brcond has to be aware of current state of MF since we run combines top-down and instructions (trunc) can be left without uses (zext was deleted by zext_trunc_fold)

arsenm added inline comments.Feb 11 2021, 3:24 PM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
72 ↗(On Diff #322084)

Opcode check first?

92 ↗(On Diff #322084)

You shouldn't need to check dominance (and I don't see how this would ever not be the case)

103 ↗(On Diff #322084)

This isn't changing the operands anymore? (I also think just creating the new instruction with the new operand is cleaner than modifying in place, doing it that way should fix the multiple use case too)

103–107 ↗(On Diff #322084)

This really shouldn't be trying to move instructions. Do you even really need to erase the trunc? If it's dead it will be removed already

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
92 ↗(On Diff #322084)

Trunc could have a use before the MI so we cant move icmp and trunc past that use.

103 ↗(On Diff #322084)

zext_trunc_fold changes operands. This now only moves icmp (and trunc for multiple use).

103–107 ↗(On Diff #322084)

Do you even really need to erase the trunc?

ICmp move breaks ssa in mir, trunc uses icmp before it is defined. I meant to delete trunc because this is the place we broke ssa, and we are aware of it. Leaving it to be deleted by something that eliminates dead instructions should work fine (I don't expect anything else to check where uses of this trunc are defined).

This really shouldn't be trying to move instructions.

What do you suggest, making new icmp (and trunc) or something else?

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-move-uniform-icmp.mir
203 ↗(On Diff #322084)

Trunc use before select.

213 ↗(On Diff #322084)

After zext_trunc fold this select uses %17 G_ICMP instead of %22 G_ZEXT but we can't move icmp because %11:sgpr(s1) = G_TRUNC %17(s32) uses icmp and %18:sgpr(s32) = G_ANYEXT %11(s1) above uses trunc.

arsenm added inline comments.Feb 18 2021, 3:20 PM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
103–107 ↗(On Diff #322084)

Yes, if you recreate the desired instruction it should automatically CSE as you need by the builder

Actually, why is this patch necessary? The ZEXT+TRUNC handling takes care of this already?

Actually, why is this patch necessary? The ZEXT+TRUNC handling takes care of this already?

Oh right, the problem here is actually the intermediate SCC copies produced as a selection artifact.

I'm not sure treating this as a combine is the correct way to go about this. The DAG handles this with a scheduler to minimize physical register liveranges. We might be better treating this off as a scheduling issue for after selection, when we directly see the SCC defs.

foad added a comment.Feb 23 2021, 5:51 AM

Actually, why is this patch necessary? The ZEXT+TRUNC handling takes care of this already?

Oh right, the problem here is actually the intermediate SCC copies produced as a selection artifact.

I'm not sure treating this as a combine is the correct way to go about this. The DAG handles this with a scheduler to minimize physical register liveranges. We might be better treating this off as a scheduling issue for after selection, when we directly see the SCC defs.

For the case I looked at (test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i64.ll) just running zext_trunc_fold as a post-regbankselect combine was enough to get rid of the SCC copies. So perhaps we should commit that first, and then worry about how to handle the remaining cases?

Actually, why is this patch necessary? The ZEXT+TRUNC handling takes care of this already?

Oh right, the problem here is actually the intermediate SCC copies produced as a selection artifact.

I'm not sure treating this as a combine is the correct way to go about this. The DAG handles this with a scheduler to minimize physical register liveranges. We might be better treating this off as a scheduling issue for after selection, when we directly see the SCC defs.

For the case I looked at (test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.end.cf.i64.ll) just running zext_trunc_fold as a post-regbankselect combine was enough to get rid of the SCC copies. So perhaps we should commit that first, and then worry about how to handle the remaining cases?

Yes, these are unrelated problems

Petar.Avramovic retitled this revision from AMDGPU/GlobalISel: Combine uniform icmp to AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect.
Petar.Avramovic edited the summary of this revision. (Show Details)

Dropping icmp move for from this patch. Leaving zext_trunc_fold.
Zext is selected into AND with 1. zext_trunc_fold results in getting rid of the SCC copies when zext was the only instruction between icmp and select/branch.

foad added a comment.Feb 23 2021, 7:44 AM

Looks good. I'm not sure we actually need to introduce the helper state class in this patch, do we? But I'll guess we'll need it later.

There is no need for helper state class.

foad accepted this revision.Feb 23 2021, 8:14 AM
This revision is now accepted and ready to land.Feb 23 2021, 8:14 AM
This revision was landed with ongoing or failed builds.Mar 4 2021, 6:06 AM
This revision was automatically updated to reflect the committed changes.