This is an archive of the discontinued LLVM Phabricator instance.

[X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760)
ClosedPublic

Authored by RKSimon on Apr 22 2021, 8:21 AM.

Details

Summary

This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read.

As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes.

I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations.

Diff Detail

Event Timeline

RKSimon created this revision.Apr 22 2021, 8:21 AM
RKSimon requested review of this revision.Apr 22 2021, 8:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2021, 8:21 AM
RKSimon updated this revision to Diff 339996.Apr 23 2021, 6:04 AM
RKSimon edited the summary of this revision. (Show Details)

Address the atomic flags regressions by implementing a similar CC tweak in combineSetCCAtomicArith - I'm happy to pull the atomic change out into another patch although not all paths would be tested until this patch had landed.

Some thoughts.

Some min/max-like patterns lose out because the comparison constant is now different from the select value - do we have any opinions on how critical this is?

This does seem not ideal. Can we even check that the cmp has a select with such an operand by now?

llvm/lib/Target/X86/X86ISelLowering.cpp
23462–23463

Why do we handle inverse case of decrementing in combineSetCCAtomicArith() but not here?

23465–23470

This comment reads as-if it's about the previous condition.
I'd suggest adding something like "note that we can't change GT to GE for tautological comparisons,
where incrementing the limit would cause an overflow" before if (!Op1Val.isNullValue() &&

llvm/test/CodeGen/X86/sdiv_fix_sat.ll
474–480

This is the clamp regression i guess?

631–632

same elesewhere in the file

Some thoughts.

Some min/max-like patterns lose out because the comparison constant is now different from the select value - do we have any opinions on how critical this is?

This does seem not ideal. Can we even check that the cmp has a select with such an operand by now?

Can we fix that in combineSelect where we canonicalize min/max patterns to use sign flag?

RKSimon planned changes to this revision.Jun 3 2021, 4:29 AM
RKSimon updated this revision to Diff 352134.Jun 15 2021, 7:38 AM

Add SETLT/SETULT handling

RKSimon retitled this revision from [X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760) to [X86] Canonicalize LT/GT compares with constants to use LE/GE to reduce the number of EFLAGs reads. (PR48760).Jun 15 2021, 12:48 PM
RKSimon edited the summary of this revision. (Show Details)
lebedev.ri edited the summary of this revision. (Show Details)Jun 15 2021, 1:10 PM

Add SETLT/SETULT handling

Thanks.
I think this looks ok to me.
What about the regressions?

llvm/lib/Target/X86/X86ISelLowering.cpp
42074

APInt::isMinValue()?

pengfei added inline comments.Jun 15 2021, 8:13 PM
llvm/test/CodeGen/X86/lack-of-signed-truncation-check.ll
601–602

Is it possible that we happen to exceed IMM16?

RKSimon updated this revision to Diff 352693.Jun 17 2021, 6:07 AM
RKSimon edited the summary of this revision. (Show Details)

Use APInt::isMinValue() for consistency

Added fold for select(seteq(X,Y),A,select(setgt(X,Y), A, B)) -> select(setge(X,Y), A, B) folds for SETCC after type legalization - this could be added separately if required.

This just leaves a few regressions where we create 2 64-bit constants instead of 1

RKSimon marked an inline comment as done.Jun 17 2021, 6:08 AM
RKSimon added inline comments.
llvm/test/CodeGen/X86/lack-of-signed-truncation-check.ll
601–602

No - we check for min/max values before decrementing/incrementing to ensure we don't wrap the value.

RKSimon added inline comments.Jun 17 2021, 6:26 AM
llvm/test/CodeGen/X86/ctpop-combine.ll
45 ↗(On Diff #352693)

Regression - we go from a CF read to CF+ZF reads - it looks in most cases the LT->LE conversions are going to cause this - I think I'm going to just handle the GT->GE cases initially.

RKSimon planned changes to this revision.Jun 17 2021, 7:08 AM

The equivalent less-than fold to reduce EFLAGS dependency is SLE/ULE -> SLT/ULT - which we don't typically need to do because TargetLowering.SimplifySetCC already canonicalizes to that - sorry I was on automatic when was asked to support SLT/ULT (and it'd been far too long since I looked at the patch...)

pengfei added inline comments.Jun 17 2021, 7:09 AM
llvm/test/CodeGen/X86/lack-of-signed-truncation-check.ll
601–602

I just saw you check int8 and int32. But I cannot create a case for the int16 boundary value due to its promoted to int32.

RKSimon added inline comments.Jun 17 2021, 7:27 AM
llvm/test/CodeGen/X86/lack-of-signed-truncation-check.ll
601–602

Yes i8 and i32 immediates are special cases because the width of the immediate might not match the width of the operand type - but if we're using an i16 immediate on i32/i64 it will always be extended to i32 immediate.

RKSimon updated this revision to Diff 352738.Jun 17 2021, 8:21 AM
RKSimon retitled this revision from [X86] Canonicalize LT/GT compares with constants to use LE/GE to reduce the number of EFLAGs reads. (PR48760) to [X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760).
RKSimon edited the summary of this revision. (Show Details)

Remove the unnecessary SLT/ULT handling + avoid the remaining i64 immediate comparison regressions.

RKSimon added inline comments.Jun 17 2021, 8:23 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
41891

This can be pushed first which should give us most of the sdiv_fix_sat.ll + udiv_fix_sat.ll codegen improvements

craig.topper added inline comments.Jun 17 2021, 8:34 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
23472

Should this be inside emitFlagsForSetcc? That way setcc+brcond works. LowerBRCOND doesn't call LowerSETCC.

RKSimon added inline comments.Jun 17 2021, 8:41 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
23472

Yes, I was hoping to move it inside TranslateX86CC later (see TODO) - as it felt like this caused far too many diffs as an initial patch - I'll can try again now that the SLT/ULT handling has been dropped.

RKSimon marked an inline comment as not done.Jun 17 2021, 9:37 AM
RKSimon added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
23472

Yes there's a huge amount of test case churn that I'd prefer to do as a follow up patch for review - and just stick with the TODO while we handle the select/cmov cases first.

craig.topper added inline comments.Jun 17 2021, 9:40 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
23472

Ok.

RKSimon marked an inline comment as not done.Jun 21 2021, 4:43 AM

Does anyone have any more comments?

What about the case where the active bit width decreases, e.g. -129 is i9, incremented it is -128, which is i8.

llvm/test/CodeGen/X86/sdiv_fix_sat.ll
474–480

And it's still here. Do we believe that the addition materialization cost is hidden
by the cmp improvement?
Can't we not do this if the SETCC is used by a SELECT with one hand matching the unchanged immediate?

RKSimon updated this revision to Diff 354209.Jun 24 2021, 4:26 AM

rebase and permit reduction from i32 to i8 immediate encoding

lebedev.ri added inline comments.Jun 24 2021, 4:51 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
23476

Comment maybe needs updatind.

23486–23487

I've spent way too much time trying to understand this block,
which means this is too compilcated and may or may not be incorrect.
I would suggest something like

if(Op1ValPlusOne.isSignedIntN(32) && (!Op1Val.isSignedIntN(8) || Op1ValPlusOne.isSignedIntN(8))

(isSignedIntN(N+1) guarantees that isSignedIntN(N))

lebedev.ri added inline comments.Jun 24 2021, 4:58 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
23481

Actually, what about: -2147483649 is i33, incremented it is -2147483648, i32.

RKSimon updated this revision to Diff 355206.Jun 29 2021, 6:21 AM

Cleaned up the isSignedIntN logic

RKSimon added inline comments.Jun 29 2021, 6:25 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
23481

Sorry - missed this when I was getting back to speed - I'll add a test

RKSimon updated this revision to Diff 355221.Jun 29 2021, 6:51 AM

rebase with the sgt i33 -> sge i32 test

lebedev.ri accepted this revision.Jun 29 2021, 7:03 AM

LGTM, thank you, sorry for so many back and forth here.
Any other comments?

llvm/lib/Target/X86/X86ISelLowering.cpp
23476

Still needs updating

This revision is now accepted and ready to land.Jun 29 2021, 7:03 AM
spatel added inline comments.Jun 29 2021, 4:55 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
23473

I didn't make the connection from SGT -> SGE to SETG -> SETGE to ZF/OF/SF or SETA -> SETAE to ZF/CF and then to an actual perf difference until re-reading the comments in PR48760 and opening the x86 manual...and I'm still not entirely clear about it. :) This deserves more explanation in the code comment.

Maybe something like:
The "GE" conditions map to less EFLAGS bits than their "GT" counterparts. Specifically, the GE conditions don't read the ZF. This may translate to less uops depending on uarch implementation.