This is an archive of the discontinued LLVM Phabricator instance.

[Peephole-Opt][X86] Enhance peephole opt to see through SUBREG_TO_REG (following AND) and eliminates redundant TEST instruction.
ClosedPublic

Authored by mingmingl on Apr 20 2022, 12:38 PM.

Details

Summary

The goal of this patch is to optimize away the test instruction in the following pattern

%reg = and ... EFLAGS set
...
EFLAGS not changed.
%extended_reg = subreg_to_reg 0, %reg, %subreg.sub_index. EFLAGS not changed
testrr %extended_reg, %extended_reg, implicit-def $eflags.
This could be optimized away.

Before this patch, the optimization wasn't carried out, since extended_reg is defined by SUBREG_TO_REG, which is not handled by isDefConvertible in X86InstrInfo.cpp.

The implementation optimize away the test if the bits used by TEST readers are already set by AND.

Diff Detail

Event Timeline

mingmingl created this revision.Apr 20 2022, 12:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 12:38 PM
mingmingl requested review of this revision.Apr 20 2022, 12:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2022, 12:38 PM

I suspect there is something else going on.

The following case works fine without the patch.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %2
store i64 %7, ptr %0, align 8
ret i64 %5

}

However a simple change below to replace the operand %2 in the select with %5 (output of and operation), then it stops working.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %5
store i64 %7, ptr %0, align 8
ret i64 %5

}

Can you investigate the difference? The behavior difference on ARM is also worth comparing with.

craig.topper added inline comments.
llvm/test/CodeGen/X86/peephole-test-after-add.mir
30

Are the other basic blocks important?

108

Please use update_mir_test_checks.py

craig.topper added inline comments.Apr 20 2022, 4:47 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
986

SUBREG_TO_REG is only used for i32->i64 conversions. So the only TEST it could ever be is TEST64rr.

997

This is making a copy of the a MachineOperand. Can you use a const reference?

999

It's not necessary.

1010

This is making a copy of the MachineOperand

1015

Drop curly braces

1020

Variable names should be capitalized

1058

Another copy

4317

AND/OR/XOR set the OF flag to 0.

4318

If you look through SUBREG_TO_REG then the width changed. The TEST64rr would have calculated the sign flag from bit 63(which is 0 due to the SUBREG_TO_REG), but the AND32 calculated it from bit 31 which is unknown.

I suspect there is something else going on.

The following case works fine without the patch.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %2
store i64 %7, ptr %0, align 8
ret i64 %5

}

However a simple change below to replace the operand %2 in the select with %5 (output of and operation), then it stops working.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %5
store i64 %7, ptr %0, align 8
ret i64 %5

}

Can you investigate the difference? The behavior difference on ARM is also worth comparing with.

There's a peephole in X86DAGToDAGISel::Select and PostProcessISelDAG that both fail if the AND has users other than the TEST.

davidxl edited reviewers, added: craig.topper; removed: davidxl.Apr 20 2022, 9:52 PM
davidxl added a subscriber: davidxl.

I suspect there is something else going on.

The following case works fine without the patch.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %2
store i64 %7, ptr %0, align 8
ret i64 %5

}

However a simple change below to replace the operand %2 in the select with %5 (output of and operation), then it stops working.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %5
store i64 %7, ptr %0, align 8
ret i64 %5

}

Can you investigate the difference? The behavior difference on ARM is also worth comparing with.

There's a peephole in X86DAGToDAGISel::Select and PostProcessISelDAG that both fail if the AND has users other than the TEST.

thanks. That explains it.

For this case, the subreg_to_reg is indeed the reason blocking the peephole-opt from cleaning it up (as can be seen in the case if the AND operand is changed to 0x1ffffffff where subreg_to_reg is not generated)

llvm/lib/Target/X86/X86InstrInfo.cpp
1043

Is there need to handle OR and XOR here? It seems we won't see code patterns with OR followed by subreg_to_reg.

mingmingl updated this revision to Diff 424119.Apr 21 2022, 1:03 AM
mingmingl marked 12 inline comments as done.

Resolve the comments.

mingmingl added inline comments.Apr 21 2022, 1:04 AM
llvm/lib/Target/X86/X86InstrInfo.cpp
986

It looks like when SUBREG_TO_REG is used in SSE, the result type is not 64 bit [1]. So revised this helper function to

  1. only handle the case when CmpInstr is TEST64rr and SUBREG_TO_REG is using sub_32bit as subregister index
  2. returns false early for the rest of cases.

[1] https://github.com/llvm/llvm-project/blob/334522ca58aa6d6703405e479de8f5bf310ddbe2/llvm/lib/Target/X86/X86InstrSSE.td#L289

1043

(After reading this, it occurred to me existing definitions in td files in llvm/lib/Target/X86 are useful to find out how operands of SUBREG_TO-REG are generated).

Revised to handle AND only, considering

  1. SUBREG_TO_REG almost always use 0 as immediate number.
  2. it's hard to know the SF bit for OR, or XOR given 0 is the immediate number.

p.s. It seems [1] is where pattern (add64 -> and32 + subreg_to_reg) is given a higher priority, and I assume removing this from td and looking at the result could verify.

[1] https://github.com/llvm/llvm-project/blob/334522ca58aa6d6703405e479de8f5bf310ddbe2/llvm/lib/Target/X86/X86InstrCompiler.td#L1619

4317

ah.. thanks for the catch! corrected it.

4318

thanks for the catch!

Fix it by adding helper function isSFZero, and only optimize away the TEST64rr when SF is known to be zero.

llvm/test/CodeGen/X86/peephole-test-after-add.mir
30

Merge two IR functions into one file and by using the example from David.

I suspect there is something else going on.

The following case works fine without the patch.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %2
store i64 %7, ptr %0, align 8
ret i64 %5

}

However a simple change below to replace the operand %2 in the select with %5 (output of and operation), then it stops working.

define dso_local noundef i64 @_Z3fooPlll(ptr nocapture noundef %0, i64 noundef %1, i64 noundef %2) local_unnamed_addr #0 {

%4 = load i64, ptr %0, align 8
%5 = and i64 %4, 3
%6 = icmp eq i64 %5, 0
%7 = select i1 %6, i64 %1, i64 %5
store i64 %7, ptr %0, align 8
ret i64 %5

}

Can you investigate the difference? The behavior difference on ARM is also worth comparing with.

There's a peephole in X86DAGToDAGISel::Select and PostProcessISelDAG that both fail if the AND has users other than the TEST.

thanks. That explains it.

For this case, the subreg_to_reg is indeed the reason blocking the peephole-opt from cleaning it up (as can be seen in the case if the AND operand is changed to 0x1ffffffff where subreg_to_reg is not generated)

Besides peephole-opt in X86ISelDAGToDAG that looks for (X86cmp (and $op, $imm), 0)[1], another significance is in legalization

Using llc -O3 -debug -debug-only=isel,x86-isel,selectiondag,legalizedag,peephole-opt,x86-instr-info <file.ll> gives the following log respectively

To elaborate,

Tracking the implementation, X86TargetLowering::emitFlagsForSetcc has different paths

  • when AND has one use (around line 24366), no cmp is emit
  • for the bad case, will call EmitCmp around line 24426

(X86ISelLowering.cpp is too large to display in Github, or generate a permanent link, so add the line number)

[1] https://github.com/llvm/llvm-project/blob/360d44e86defea94fb5608765fbdbfdb2a36f4c6/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp#L5608

mingmingl updated this revision to Diff 424124.Apr 21 2022, 1:23 AM

resolve a build failure due to conflicting function name (same return type, different argument list)

craig.topper added inline comments.Apr 25 2022, 9:16 AM
llvm/lib/Target/X86/X86InstrInfo.cpp
980

iz->is

mingmingl updated this revision to Diff 424945.Apr 25 2022, 9:43 AM
mingmingl marked an inline comment as done.
mingmingl edited the summary of this revision. (Show Details)

Resolve the comment.

resolve comments

Remove new lines added by editor.

Mix two local branches.

Corrected it.

a gentle ping!~

craig.topper added inline comments.Apr 27 2022, 7:02 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
970

I don't know if this is necessary. If the SUBREG_TO_REG was used, the sign bit of the full 64 bit value was provably zero during SelectionDAG. I hope any branch, setcc, cmov would have been folded away based on it.

Can't we just set NoSignFlag=true?

mingmingl marked an inline comment as done.Apr 28 2022, 12:38 PM

thanks for reviews!

llvm/lib/Target/X86/X86InstrInfo.cpp
970

isSFZero is used to tell the difference of following two scenarios, and (if I read correctly) NoSignFlag (as well as ClearsOverflowFlag) indicate whether the EFLAG bit is set (as opposed to the value of bit), so for AND operation, we want to set NoSignFlag to false.

AND32ri8 reg, 0x9  # signed bit is 1, so after sign extension, the SF bit could be 1 or 0 depending on reg value
SUBREG_TO_REG 0, extended_reg, reg , sub_32bit
TEST64rr extended_reg, extended_reg,  # SF bit is 0 since higher 32 bit of extended_reg is 0
AND32ri8 reg, 0x7  # signed bit is 0, so after sign extension, the SF bit is known to be 0
SUBREG_TO_REG 0, extended_reg, reg , sub_32bit
TEST64rr extended_reg, extended_reg,  # SF bit is 0 since higher 32 bit of extended_reg is 0

We want to optimize away test for 2) but keep test for 1).

Let me know if I miss anything!

craig.topper added inline comments.Apr 28 2022, 12:55 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
970

If the instructions that use the flags from the TEST don't look at the SF flag then it doesn't matter. Setting NoSignFlag = true will disable the optimization if the SF flag is used. Your motivating test case doesn't use the sign flag so this should be sufficient.

mingmingl updated this revision to Diff 425900.Apr 28 2022, 2:07 PM
mingmingl marked an inline comment as done.

Set NoSignFlag to false (to poison SF bit for readers), and get rid of isSFZero helper function.

mingmingl marked an inline comment as done.Apr 28 2022, 2:08 PM
mingmingl added inline comments.
llvm/lib/Target/X86/X86InstrInfo.cpp
970

I see your point now. Updated. PTAL.

mingmingl updated this revision to Diff 425906.Apr 28 2022, 2:25 PM
mingmingl marked an inline comment as done.

Correct the auto-format a little bit

Before

// The reason to poison SF bit is that SF bit value could be different
// in the `AND` and `TEST` operation
//
// dest_reg = AND32ri8 src_reg, #imm  # signed bit is not known without
// peeking at #imm. extended_dest_reg = SUBREG_TO_REG 0, dest_reg, sub_32bit

After

// The reason to poison SF bit is that SF bit value could be different
// in the `AND` and `TEST` operation; signed bit is not known for `AND`
// without peeking at #imm, and is known to be 0 as a result of
// `TEST64rr`.
// dest_reg = AND32ri8 src_reg, #imm
// extended_dest_reg = SUBREG_TO_REG 0, dest_reg, sub_32bit
// TEST64rr extended_dest_reg, extended_dest_reg,

A gentle ping!~

skan added a comment.May 7 2022, 7:50 AM

As to the desciption, maybe it should not be called a heuristic optimization?

llvm/lib/Target/X86/X86InstrInfo.cpp
970

Remove inline?

988–993

Could we remove these assert? I think they're not useful.

1004

Useless assert

1026

And32->and32

1027–1031

The format of comments here seems weird?

1032

And32->AND32

1048–1050

"The optimizatin will be disasbled"
I think this is not covered by your test yet?

1054

Why don't we peek at imm here? It's possible to set NoSignFlag to false.

1056–1058

Could you uniform the instruction name in your comments? You use SUBREG_TO_REG here while subreg_to_reg somewhere.

skan added inline comments.May 7 2022, 7:51 AM
llvm/test/CodeGen/X86/peephole-test-after-add.mir
2

Minor suggestion -o - %s -> < %s

skan added inline comments.May 7 2022, 7:59 AM
llvm/lib/Target/X86/X86InstrInfo.cpp
1054

Never mind. I missed craig's comments

skan added inline comments.May 9 2022, 4:52 AM
llvm/test/CodeGen/X86/peephole-test-after-add.mir
2

Never mind. I didn't realize that "< %s" does not work for "--run-pass".

craig.topper added inline comments.May 9 2022, 9:01 AM
llvm/test/CodeGen/X86/peephole-test-after-add.mir
2

I think it doesn't work for .mir input because llc uses the extension of the file name is used to determine file type. If you use re-direction llc can't see the extension so just assumes .ll.

mingmingl updated this revision to Diff 428205.May 9 2022, 2:39 PM
mingmingl marked 13 inline comments as done.
mingmingl edited the summary of this revision. (Show Details)

Address the style comments, and add a regression test to gate the behavior that TEST is retained if SF bit is read.

llvm/lib/Target/X86/X86InstrInfo.cpp
970

Removed inline.

For my understanding, is it because of lengthy function body and long argument list that inline hint doesn't make much sense here?

988–993

Could we remove these assert? I think they're not useful.

Removed the second assert(CmpValDefInstr.getNumOperands() == 4).

I wonder if the 1st assert (assert(MRI->getVRegDef(CmpInstr.getOperand(0).getReg()) == &CmpValDefInstr)) helps in the sense that

  1. the information asserted (def-use between this two instructions) are not explicitly mentioned in any comment
  2. this information is important for the correctness

Nevertheless kept the 1st assert and added comment. Let me know what I miss.

1004

Since the function body getImm() or getReg() has asserts, it makes sense to just rely on that to give enough debugging information.

So removed three asserts (on MO3, MO2, MO1)

1027–1031

thanks for the catch! Fixed it.

1048–1050

it makes sense to add a test.

Added test_not_erased_when_sf_used in llvm/test/CodeGen/X86/peephole-test-after-add.mir, to cover the behavior that TEST64rr is retained if SF bit is used.

1054

Yeah I was initially thinking of peeking at imm, and later (upon Craig's comment) realized there is another way to do this.

1056–1058

Unified the instruction name in this patch.

p.s. I realized this sequence of instructions should be better placed around line 1024, so merge it with the sequence of instructions right after if (X86::isAND(VregDefInstr->getOpcode())) {

llvm/test/CodeGen/X86/peephole-test-after-add.mir
2

thanks for the discussion! I'll keep it the current way and reply first, before getting a better idea of the difference (between -o - %s and %s in this command) myself.

As to the desciption, maybe it should not be called a heuristic optimization?

modified summary with diff

  1. removed heurstic
  2. optimize test away if EFLAG bits used as a result of test are already set by AND.
mingmingl updated this revision to Diff 428251.May 9 2022, 5:59 PM

[Only two minor style changes]
Rename the .ll file in llvm/test/CodeGen/X86/peephole-test-after-add.mir, and remove braces on simple single statement bodies of if-else statements.

skan added inline comments.May 9 2022, 7:48 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
970

Yes. In fact, I think it's useless to add inline to functions w/ internal linkage for non-peformance critical code. Compiler can decide whether it should be inlined.

995

What does the "Caller guranteed" mean here? Does it say that TEST64rr is the user of SUBREG_TO_REG?

1031

const MachineInstr &

mingmingl updated this revision to Diff 428279.May 9 2022, 8:40 PM
mingmingl marked an inline comment as done.

Replied the comments.

Besides, in findRedundantFlagInstr, change from

  const MachineOperand& MO1 = CmpValDefInstr.getOperand(1);
// As seen in X86 td files, MO1 is typically
  // 0.
  if (.MO1.getImm() != 0)
    return false;

to

// As seen in X86 td files, CmpValDefInstr.getOperand(1).getImm() is typically
  // 0.
  if (CmpValDefInstr.getOperand(1).getImm() != 0)
    return false;

Same for MO2 and MO3.

skan accepted this revision.May 9 2022, 8:49 PM

LGTM!

This revision is now accepted and ready to land.May 9 2022, 8:49 PM

(I forgot to click "reply" to send out the inline comments... but going to do it now)

Thanks for all reviews and inputs!

Going to hold it for a while since other reviewers might have comments as well.

llvm/lib/Target/X86/X86InstrInfo.cpp
970

Compiler can decide whether it should be inlined.

This is true.

995

Does it say that TEST64rr is the user of SUBREG_TO_REG?

Yes this is correct. I think this assertion might prevent inadvertent misuse of findRedundantFlagInstr, yet I agree it's probably premature at this point.

Let me know if it's more idiomatic to just remove this assertion.

davidxl added inline comments.May 9 2022, 9:57 PM
llvm/test/CodeGen/X86/peephole-test-after-add.mir
24

The and instruction clears SF and upper bits of the value which means the test instruction later will also produce SF == 0. In other words, in this case, not only the test instruction, but also the CMOVErr can be optimized into a copy.

(this can be looked at in a follow up patch if it is the case).

craig.topper added inline comments.May 9 2022, 11:09 PM
llvm/test/CodeGen/X86/peephole-test-after-add.mir
24

Do we not figure that out in SelectionDAG?

mingmingl marked 2 inline comments as done.May 10 2022, 11:33 AM
mingmingl added inline comments.
llvm/test/CodeGen/X86/peephole-test-after-add.mir
24

The and instruction clears SF and upper bits of the value which means the test instruction later will also produce SF == 0. In other words, in this case, not only the test instruction, but also the CMOVErr can be optimized into a copy.

(this can be looked at in a follow up patch if it is the case).

Added a FIXME around line 1052 (inside findRedundantFlagInstr) to follow up on this.

Do we not figure that out in SelectionDAG?

Not sure if we could figure it out in SelectionDAG step. There is TargetInstrInfo::optimizeSelect virtual method for peephole-opt, and an implementation (ARMBaseInstrInfo::optimizeSelect) [1]. Will look into which pass to put this optimization later.

[1] https://github.com/llvm/llvm-project/blob/0353c2c996c5863463c356de97c9852f9330ed11/llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp#L2321

mingmingl marked an inline comment as done.

Add FIXME for a follow-up optimization.

craig.topper added inline comments.May 10 2022, 11:44 AM
llvm/test/CodeGen/X86/peephole-test-after-add.mir
24

I thought DAGCombiner for ISD::SETCC should be able to figure out that an AND with 3 guarantees the sign bit is 0, but it appears it does not. InstCombine does optimize it though.

mingmingl marked an inline comment as done.May 10 2022, 12:07 PM
mingmingl added inline comments.
llvm/test/CodeGen/X86/peephole-test-after-add.mir
24

Got it. DAGCombiner sounds a right place to look at a sequence of instructions and optimize.

This revision was landed with ongoing or failed builds.May 10 2022, 12:39 PM
This revision was automatically updated to reflect the committed changes.
mingmingl marked an inline comment as done.
This revision is now accepted and ready to land.May 10 2022, 2:29 PM
mingmingl updated this revision to Diff 428500.May 10 2022, 2:31 PM

Reapply with a descriptive commit message.

mingmingl added a subscriber: reames.

Please put [X86] in the title.

mingmingl retitled this revision from [Peephole-Opt] For one kind of test-after-add pattern, eliminates test if it's correct to do so. to [Peephole-Opt][X86] Enhance peephole opt to see through SUBREG_TO_REG (following AND) and eliminates redundant TEST instruction..May 10 2022, 2:44 PM
This revision was landed with ongoing or failed builds.May 10 2022, 3:57 PM
This revision was automatically updated to reflect the committed changes.