This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][GlobalISel] Fold 64-bit cmps with 64-bit adds
ClosedPublic

Authored by paquette on Oct 4 2021, 12:09 PM.

Details

Summary

G_ICMP is selected to an arithmetic overflow op (ADDS/SUBS/etc) with a dead destination + a CSINC instruction.

We have a fold which allows us to combine 32-bit adds with G_ICMP.

The problem with G_ICMP is that we model it as always having a 32-bit destination even though it can be a 64-bit operation. So, we were missing some opportunities for 64-bit folds.

This patch teaches the fold to recognize 64-bit G_ICMPs + refactors some of the code surrounding CSINC accordingly.

(Later down the line, I think we should probably change the way we handle G_ICMP in general.)

Diff Detail

Event Timeline

paquette created this revision.Oct 4 2021, 12:09 PM
paquette requested review of this revision.Oct 4 2021, 12:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2021, 12:09 PM

What happens if we legalize ICMPs by promoting the dest to be at least as wide as the source? I'm finding it hard to follow the recent changes to work around this.

I ended up hitting some code size regressions, but I can give it another shot. Would prefer that approach.

So I've been trying to use the "widen G_ICMP to 64 bits" approach instead. Like I mentioned before, there were some code size regressions on CTMark with this approach. I've managed to get them down to just these 3 at -Os:

test-suite :: CTMark/SPASS/SPASS.test          412964       413024       0.0%
test-suite...TMark/7zip/7zip-benchmark.test    574664       574692       0.0%
test-suite...-typeset/consumer-typeset.test    418756       418764       0.0%

Overall, the code size change for CTMark -Os looks like this:

Program                                        outputPUlsd6 outputN4L9tx diff 
 test-suite :: CTMark/SPASS/SPASS.test          412964       413024       0.0%
 test-suite...TMark/7zip/7zip-benchmark.test    574664       574692       0.0%
 test-suite...-typeset/consumer-typeset.test    418756       418764       0.0%
 test-suite :: CTMark/Bullet/bullet.test        475356       475360       0.0%
 test-suite :: CTMark/kimwitu++/kc.test         435104       435104       0.0%
 test-suite...ark/tramp3d-v4/tramp3d-v4.test    368124       368124       0.0%
 test-suite :: CTMark/lencod/lencod.test        429860       429856      -0.0%
 test-suite...:: CTMark/ClamAV/clamscan.test    382552       382548      -0.0%
 test-suite...Mark/mafft/pairlocalalign.test    249104       249100      -0.0%
 test-suite...:: CTMark/sqlite3/sqlite3.test    286196       286176      -0.0%

A few things I've observed here:

  1. Originally, I assumed the G_ANYEXT->G_ZEXT combine would only would be useful for legalization-related stuff. Turns out we end up with situations like this too:
%s32_cmp = G_ICMP ...
%s64_ext = G_ANYEXT %s32_cmp
%s64_and = G_AND %s64_ext

The combine helps with this situation, which is unrelated to legalization. Probably worth keeping.

  1. It seems like we need to do a similar transformation to the one in this patch regardless, because we also want to handle the following situation:
%s64_cmp = G_ICMP ...
%s32_cmp = G_TRUNC %s64_cmp
%s32_add = G_ADD %s32_cmp ...

In this situation we want a 64-bit cmp instruction, but a 32-bit csinc instruction.

This is fine because the result of the cmp is dead; csinc just looks at the flags and increments.

So, similar to the transformation in this patch, we still have to walk through something. In this case, the MatchCmp function above ends up looking something like this:

auto MatchCmp = [&](Register Reg) -> MachineInstr * {
  if (!MRI.hasOneNonDBGUse(Reg))
    return nullptr;
  MachineInstr *Def = getDefIgnoringCopies(Reg, MRI);
  unsigned Opc = Def->getOpcode();
  if (Opc == TargetOpcode::G_TRUNC)
    return getOpcodeDef(TargetOpcode::G_ICMP, Def->getOperand(1).getReg(),
                        MRI);
  if (Opc != TargetOpcode::G_ICMP)
    return nullptr;
  return Def;
};
  1. The code size regressions don't really make much sense to me. I'm seeing things like
9	        b.lt    0x10006b3c8
30	        add     x16, x0, x15, lsl #5
31	        ldr     x17, [x16, #8]
32	        cmp     x17, #0
33	        cset    x16, eq
34	        cbnz    x17, 0x10006b39c

appearing rather than

b.ge    0x10006b39c

in the disassembled output. I'm assuming that this is a later pass' work; I can't find anywhere relating to branches in the selector that assumes we have a 32-bit compare.

Anyway, personally, I'm starting to think this patch is probably the simplest approach for this problem and maybe not as hacky as I initially thought. It turns out the apparently-cleaner approach involves a lot of the same work, but involves changing a lot more of the selector and potentially breaking assumptions later down the line.

aemerson accepted this revision.Oct 5 2021, 11:29 PM

LGTM. Thanks for at least investigating it.

This revision is now accepted and ready to land.Oct 5 2021, 11:29 PM
Matt added a subscriber: Matt.Oct 6 2021, 4:48 AM
This revision was automatically updated to reflect the committed changes.