Page MenuHomePhabricator

[X86] Performing DAG pruning before selection of LEA instructions.
Needs RevisionPublic

Authored by jbhateja on Jul 29 2018, 10:46 AM.

Details

Summary

Pruning DAG patterns to facilitate expression folding involving SUB instructions
during addressing mode based complex pattern matching.

Fixes Bug 37939 - Missed pattern B+(-C)*A

Alive Link:
https://rise4fun.com/Alive/OcH

Diff Detail

Event Timeline

jbhateja created this revision.Jul 29 2018, 10:46 AM
craig.topper added inline comments.
lib/Target/X86/X86ISelDAGToDAG.cpp
712

Why are we matching the pattern in reverse? Normally we would look for a sub followed by a shift. Why are we starting with a shift and looking backwards?

lebedev.ri added inline comments.
lib/Target/X86/X86ISelDAGToDAG.cpp
703–707

I suspect this comment should be at the call site.
The comment *here* should ideally explain what the *function* does, more specifically.

712

Early return?

720

Early return?

729

It would be great of there was a comment before each of these two cases showing which pattern it transforms into what.

jbhateja added inline comments.Jul 29 2018, 11:10 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
712

Following DAG transformation is being done in this revision

%1 = SHL X , 2
%2 = SUB Y , %1
       |
       V
%1 = SUB 0 , X
%2 = SHL %1 , 2  -> A 
%3 = ADD Y , %2  -> B

We are trying to remap SHL + SUB pattern so that it can be consumed easily, transformation is currently not kicking in for any SUB patterns but only for case where SUB is a USER of SHL.

craig.topper added inline comments.Jul 29 2018, 11:17 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
712

But why can't you look for subtracts that have shl as an operand? That's how nearly every DAG pattern match is written. I'm trying to understand why this one is different. A call to use_begin is pretty unusual the DAG.

jbhateja updated this revision to Diff 157895.Jul 29 2018, 12:45 PM
  • [X86] Review comments incorporation for patch D49966
jbhateja marked 7 inline comments as done.Jul 29 2018, 12:46 PM
jbhateja added reviewers: craig.topper, lebedev.ri.
jbhateja updated this revision to Diff 157896.Jul 29 2018, 12:56 PM
  • [X86] Correcting a comment in patch D49966
lebedev.ri added inline comments.Jul 29 2018, 1:53 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
705–707

These can be moved to after the lambda.

708

But this doesn't check that it's *argument* is ValidConstantShift.
It checks that the getOperand(1).getNode() of the argument is ValidConstantShift.

How about you sink the Opr0.getOpcode() == ISD::SHL check into the lambda, and rename the lambda somehow?

710–711

Is this sufficient to limit this to scalars only?
Also, should this be limited to only some specific EVT's?

716

Are there other patterns with different root nodes possible? If not i'd early-return.

717–718

General naming comment: it seems, these would normally be named N0 and N1, because they are 0'th and 1'th operands of N.

744

Any thoughts on whether the shl should be one-use?

jbhateja updated this revision to Diff 157910.Jul 29 2018, 8:43 PM
  • [X86] Review comment resolutions for patch D49966
jbhateja marked 5 inline comments as done.Jul 29 2018, 8:43 PM
craig.topper added inline comments.Jul 29 2018, 10:04 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
747

You can't call ReplaceAllUses with the caller holding an iterator pointing to the node after N. The replacement can trigger a CSE update that can invalidate the iterator.

You eventually need to delete N and the subtract once you've replaced it. I'm not sure anything will remove it before it goes through instruction selection. But the iterator in the caller makes that hard. You can't delete N until the iterator is moved past it. But you can't delete the subtract instruction because the iterator might be pointing at it. So I think if this optimization makes any changes, you need to call RemoveDeadNodes after the loop. Since that would be the only time it's safe to remove both.

765

Your new code needs to do the --I, Replace, ++I, Delete that's done here. Otherwise I might get invalidated by the Delete.

This comment was removed by xbolva00.
RKSimon added inline comments.Jul 30 2018, 9:10 AM
test/CodeGen/X86/lea-opt.ll
312

Are these attributes necessary: dso_local, local_unnamed_addr + #0? I think you might just need nounwind ?

jbhateja updated this revision to Diff 158025.Jul 30 2018, 11:30 AM
  • [X86] Review comments incorporation for patch D49966
RKSimon added inline comments.Jul 31 2018, 2:53 AM
test/CodeGen/X86/lea-opt.ll
311

Please can you give the functions more useful (short, descriptive) names, commit the tests to trunk with current codegen and then update this patch to show the codegen diff

jbhateja updated this revision to Diff 158460.Jul 31 2018, 9:23 PM
  • [X86] Review comments resolution for patch D49966
  • Patch rebase.
lebedev.ri edited the summary of this revision. (Show Details)Jul 31 2018, 11:38 PM
lebedev.ri added inline comments.Jul 31 2018, 11:42 PM
lib/Target/X86/X86ISelDAGToDAG.cpp
725–728

This turned out hard to read. Can you replace it with

%o0 = shl i8 %x, C   ->  %n0 = sub i8 0, %y
%r = sub i8 %o0, %y      %n1 = shl i8 %x, C
                         %r = add i8 %n1, %n0
734–737
%o0 = shl i8 %x, C   ->  %n0 = sub i8 0, %x
%r = sub i8 %y, %o0      %n1 = shl i8 %n0, C
                         %r = add i8 %y, %n1
jbhateja edited the summary of this revision. (Show Details)Aug 1 2018, 3:25 AM

We do actually want to see that alive link though.

jbhateja updated this revision to Diff 158791.Aug 2 2018, 9:59 AM
jbhateja edited the summary of this revision. (Show Details)
  • [X86] Limiting optimization to i32 and i64 types
jbhateja added inline comments.Aug 2 2018, 10:15 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
725–728

Optimization has been limited to i32 and i16, i8 optimization is correct but LEA extraction is being done only for 32 and 64 bits types. On way is to promote i8 types to i32 type which will trigger LEA selection and then truncate result to i8 , this is what is being done for i16.

/// Return true if the target has native support for the specified value type
/// and it is 'desirable' to use the type for the given node type. e.g. On x86
/// i16 is legal, but undesirable since i16 instruction encodings are longer and
/// some i16 instructions are slow.
bool X86TargetLowering::isTypeDesirableForOp(unsigned Opc, EVT VT) const {
  if (!isTypeLegal(VT))
    return false;

  // There are no vXi8 shifts.
  if (Opc == ISD::SHL && VT.isVector() && VT.getVectorElementType() == MVT::i8)
    return false;

**  if (VT != MVT::i16)
    return true;**

Ping reviewers, can put this under Size oriented optimization.

jbhateja edited the summary of this revision. (Show Details)Aug 8 2018, 9:54 AM

Ping reviewers, can put this under Size oriented optimization.

Please check Roman's comment too
https://bugs.llvm.org/show_bug.cgi?id=37939#c4

lebedev.ri added inline comments.Aug 16 2018, 3:00 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
706–707

Can you add a comment *here in the code* as to why this restriction exists?

712

I'm not sure why you need .getNode()?
Elsewhere isa<ConstantSDNode>(SDValue) works fine.

713–716

Can you add a comment before the if as to why the limit is 3?

720–721
  1. This is the only use of Opcode variable. Inline it into the usage?
  2. Why not early return? It took me a moment to understand that we indeed don't have an else if for this if.
774

Hm, why do we want to only do this in -Os/-Oz?

xbolva00 added inline comments.Aug 16 2018, 3:23 AM
lib/Target/X86/X86ISelDAGToDAG.cpp
774

I agree with Roman, we should do it always.

lebedev.ri requested changes to this revision.Aug 30 2018, 10:54 AM
This revision now requires changes to proceed.Aug 30 2018, 10:54 AM
xbolva00 added a comment.EditedOct 18 2018, 10:18 AM

Also we should avoid cases like
mov eax, dword ptr [rsi]
lea eax, [rax + rdi]

and use add instruction like gcc:
add edi, DWORD PTR [rsi]

https://godbolt.org/z/KQAOV-

cc @craig.topper

Also we should avoid cases like
mov eax, dword ptr [rsi]
lea eax, [rax + rdi]

and use add instruction like gcc:
add edi, DWORD PTR [rsi]

https://godbolt.org/z/KQAOV-

cc @craig.topper

Please file a bug. That's a completely different issue.

Does this handle also e.g.

int foo(int i,int c) {

c *= 2*i+1;
return c;

}

? We miss lea here..

Does this handle also e.g.

int foo(int i,int c) {

c *= 2*i+1;
return c;

}

? We miss lea here..

Do you have a god bolt link? I’m seeing an lea for that except on Intel CPUs that have a slow 3 src lea instruction. On those CPUs we issue an LEA, ADD, IMUL

That behavior is intentional. The X86FixupLEAs.cpp pass changed from the gcc/icc code to LEA+ADD because "lea eax, [rdi+1+rdi]" is a 3 cycle instruction with 1 cycle reciprocal throughput on all Intel Core CPUs from Sandy Bridge to present. "lea eax, [rdi+rdi]" is 1 cycle latency with 0.5 reciprocal throughput. And "add eax, 1" is 1 cycle with a reciprocal throughput of 0.33 or 0.25. So the 2 instructions should be better performing than the single LEA. Though it is bad that -Os doesn't disable this optimization.