This is an archive of the discontinued LLVM Phabricator instance.

[GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND
ClosedPublic

Authored by tobias-stadler on Aug 29 2023, 2:18 PM.

Details

Summary

The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:

  • We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
  • The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
  • This cleans up a lot of legalizer output and even improves compilation-times.

AArch64 CTMark geomean: O0 -5.6% size..text; O0 and O3 ~-0.9% compilation-time (instruction count).

Since this introduces KnownBits into code-paths used by O0, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.

Diff Detail

Event Timeline

tobias-stadler created this revision.Aug 29 2023, 2:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2023, 2:18 PM
tobias-stadler requested review of this revision.Aug 29 2023, 2:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2023, 2:18 PM
tobias-stadler edited the summary of this revision. (Show Details)Aug 29 2023, 9:23 PM

We only emit the mask G_CONSTANT when necessary. Even when the G_AND is combined away later, the constant sometimes ends up being reused by other instructions instead of becoming dead.

I'm a bit confused by how this could happen. Does this happen with optimizations?

The wording suggests that a later transform needs a G_AND, and probably the CSEMIRBuilder returns a reference to dead one? Why would other instructions need a dead instruction?

arsenm added inline comments.Aug 30 2023, 6:24 AM
llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
143

How much of the benefit is there from only handling the constant case?

We only emit the mask G_CONSTANT when necessary. Even when the G_AND is combined away later, the constant sometimes ends up being reused by other instructions instead of becoming dead.

I'm a bit confused by how this could happen. Does this happen with optimizations?

The wording suggests that a later transform needs a G_AND, and probably the CSEMIRBuilder returns a reference to dead one? Why would other instructions need a dead instruction?

It happens at O3. You can take a look at https://godbolt.org/z/G97Kv6bza (this is taken from one of the Atomics tests). Maybe "reused" is the wrong wording. %47 is the mask constant emitted for the G_AND, but is then also used in various other places right after the legalizer. I assume this is intended CSE behavior, but I'd prefer this not to happen between G_UADDE sequences, so that these patterns can easily be detected by the selector again. Getting rid of the useless G_ANDs everywhere seemed like the cleanest solution and happens to also be a win in every CTMark metric.
Some more context: When I implemented wide adds in the selector, I opted to prevent unnecessary NZCV reloads by looking at the previous instruction to determine if NZCV will already be set correctly. This seems like the a good solution because it doesn't require custom legalization or changes to the way we handle flags (I think X86 does copies to and from the flags register and let's RegAlloc do the heavy lifting). The example also hits another issue though: RegBankAlloc decides to move one of the vRegs to fpr and inserts a copy, which also inhibits the optimization. I will have to look into if I missed some more edge cases like this. I think in the worst case we can always do some more cleanup in PostSelectOptimize, since we already do liveness analysis there. Though I'd prefer to make legalized G_UADDE sequences reliably detectable by the selector, so that we can still get good codegen without going the X86 route or resorting to additional optimizations.

tobias-stadler added inline comments.Aug 30 2023, 6:44 AM
llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
143

I'm not quite sure what you mean by "constant case". Do you mean after a single computeKnownBitsImpl calll? I can try that by reducing MaxDepth even more and will report. I believe most of the benefit comes from the boolean case. I'm not aware of a method to determine if we can ignore the AND other than how KnownBits does it using BooleanContents for applicable instructions, in which case I wouldn't want to duplicate this functionality here. Maybe we can add a MaxDepth override for cases like this?

tobias-stadler added inline comments.Aug 30 2023, 7:28 AM
llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
143

MaxDepth = 1 compared to MaxDepth = 6 doesn't seem to measurably affect compile-time, but geomean size..text goes from -5.6% to -5.5%. I don't think there's much to gain by adding extra complexity here just too avoid KnownBits.

We only emit the mask G_CONSTANT when necessary. Even when the G_AND is combined away later, the constant sometimes ends up being reused by other instructions instead of becoming dead.

I'm a bit confused by how this could happen. Does this happen with optimizations?

The wording suggests that a later transform needs a G_AND, and probably the CSEMIRBuilder returns a reference to dead one? Why would other instructions need a dead instruction?

It happens at O3. You can take a look at https://godbolt.org/z/G97Kv6bza (this is taken from one of the Atomics tests). Maybe "reused" is the wrong wording. %47 is the mask constant emitted for the G_AND, but is then also used in various other places right after the legalizer. I assume this is intended CSE behavior, but I'd prefer this not to happen between G_UADDE sequences, so that these patterns can easily be detected by the selector again. Getting rid of the useless G_ANDs everywhere seemed like the cleanest solution and happens to also be a win in every CTMark metric.
Some more context: When I implemented wide adds in the selector, I opted to prevent unnecessary NZCV reloads by looking at the previous instruction to determine if NZCV will already be set correctly. This seems like the a good solution because it doesn't require custom legalization or changes to the way we handle flags (I think X86 does copies to and from the flags register and let's RegAlloc do the heavy lifting). The example also hits another issue though: RegBankAlloc decides to move one of the vRegs to fpr and inserts a copy, which also inhibits the optimization. I will have to look into if I missed some more edge cases like this. I think in the worst case we can always do some more cleanup in PostSelectOptimize, since we already do liveness analysis there. Though I'd prefer to make legalized G_UADDE sequences reliably detectable by the selector, so that we can still get good codegen without going the X86 route or resorting to additional optimizations.

I see what you mean now. Is there any code size impact with optimizations (-Os or -O3)? I expect there to be none but just want to check.

arsenm added inline comments.Aug 30 2023, 7:55 AM
llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
143

I mean by the input is a literal G_CONSTANT or copy of G_CONSTANT

I see what you mean now. Is there any code size impact with optimizations (-Os or -O3)? I expect there to be none but just want to check.

"size" is completely unchanged on -Os and -O3.
Very minor improvements on -Os and -O3 "size..text" on ~half of the benchmarks, no difference on the others. Biggest difference on lencod -O3 682928.00->682860.00.

llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
143

I tried it out. Only checking when lookThroughCopyInstrs(AndSrc) is a G_CONSTANT brings no benefit at all. A simple example of G_ZEXT(G_TRUNC(G_CONSTANT)) is already combined away by the other combines in the artifact combiner. The only difference between me directly checking for G_CONSTANT in this combine is that we end up with one less stray COPY.

tobias-stadler added inline comments.Aug 30 2023, 4:51 PM
llvm/include/llvm/CodeGen/GlobalISel/LegalizationArtifactCombiner.h
143

Sorry, I didn't phrase this clearly. I first ran the "constant-case-only" version against O0 CTMark, which didn't show any difference in code size and compile-time compared to the current main branch. I then ran a simple hand-coded G_ZEXT(G_TRUNC(G_CONSTANT)) example against both versions to make sure the additional check did the right thing, which confirmed that there shouldn't be much of a difference in the first place. I couldn't think of and haven't seen examples in the tests where a G_CONSTANT is truncated and then extended again in such a way that this combine would emit better code than the other combines (apart from removing some COPY instructions, which also happens in some of the test differences).

tobias-stadler edited the summary of this revision. (Show Details)

Rebase.

@arsenm @aemerson Hi, do you have any further comments on this or are you ok with this change?

aemerson accepted this revision.Sep 27 2023, 3:30 PM

@arsenm @aemerson Hi, do you have any further comments on this or are you ok with this change?

I think the benefits outweigh the costs in this case. I would just put a comment in the code explaining exactly why this combine is there, since someone may try to remove it in future if they don't notice any impact on final codegen.

This revision is now accepted and ready to land.Sep 27 2023, 3:30 PM

Add suggested comment + Rebase.

Fix broken test.

tobias-stadler reopened this revision.Nov 1 2023, 12:00 PM

This was reverted due to an issue fixed by https://github.com/llvm/llvm-project/pull/68840 and should now be ready to reland.

This revision is now accepted and ready to land.Nov 1 2023, 12:00 PM

Rebase, fix tests, and rerun CI before relanding.