This is an archive of the discontinued LLVM Phabricator instance.

Add a machine function pass to convert binop(phi(constants), v) to phi(binop)
Needs ReviewPublic

Authored by Carrot on Feb 15 2022, 10:55 PM.

Details

Summary

This is a Machine Function version of D116058, also a conservative version of D37467 without critical edge splitting.

This pass does the transformation

  bb.0:
    ...
    %3:gr32 = MOV32ri 7
    JCC_1 %bb.2, 5, implicit $eflags

  bb.1:
    %5:gr32 = MOV32ri 11

  bb.2:
    %0:gr32 = PHI %3:gr32, %bb.0, %5:gr32, %bb.1                         // At most one non-const operand
    %6:gr32 = ADD32rr %2:gr32(tied-def 0), %0:gr32, implicit-def dead $eflags
    ...

=>

  bb.0:
    ...
    %7:gr32 = ADD32ri8 %2:gr32(tied-def 0), 7, implicit-def dead $eflags
    JCC_1 %bb.2, 5, implicit $eflags

  bb.1:
    %8:gr32 = ADD32ri8 %2:gr32(tied-def 0), 11, implicit-def dead $eflags

  bb.2:
    %6:gr32 = PHI %7:gr32, %bb.0, %8:gr32, %bb.1
    ...

So we can reduce both static and dynamic instructions. If there is one non-const operand, we can still do the transformation without code size improvement, but still have performance improvement.

Test case dup-phi-users1.ll is from D116058 and dup-phi-users2.ll is from D37467.

Diff Detail

Event Timeline

Carrot created this revision.Feb 15 2022, 10:55 PM
Carrot requested review of this revision.Feb 15 2022, 10:55 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2022, 10:55 PM

I suppose this is the right place for this transformation, rather than in middle-end.
Is this the same transform as D117110? Some enhancements are missing there, but do we need both?

My first question here would be whether it is preferable to do this as a machine-pass, or as a backend IR pass (which runs post-LSR at least, possibly part of CGP). Do you have any thoughts about what the trade-offs between those two options would be?

I suppose this is the right place for this transformation, rather than in middle-end.
Is this the same transform as D117110? Some enhancements are missing there, but do we need both?

D117110 requires two phis, it works on code binop(phi1, phi2).
This patch covers more cases, it works on instructions binop(phi, v).

I think D117110 is still required for reasons:

  • Earlier transformation can also benefit other optimizations. For example in test case fold-phi-of-binary-op.ll of D115914, if the or(phi, phi) can be optimized in the mid end, tailcall can be generated by SelectionDAG. It we do the transformation in back end, we will miss the tailcall chance.
  • D117110 works for all targets. This patch needs a target implementation of canFoldImmediate.

My first question here would be whether it is preferable to do this as a machine-pass, or as a backend IR pass (which runs post-LSR at least, possibly part of CGP). Do you have any thoughts about what the trade-offs between those two options would be?

@lebedev.ri mentioned there is no immediate materialization instructions at IR. These instructions are explicitly generated in Machine IR. For CGP there is comment "It should eventually be removed." in the file, so my understanding is we should not add too many things in it if possible. I didn't thought too much about other places in backend LLVM IR because I thought most of them are lowering and expanding work that should be done before SelectionDAG.

Carrot updated this revision to Diff 411289.Feb 24 2022, 7:09 PM

Reformat.

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 8:32 PM

Sorry, i don't have useful feedback here.

Any comments from other reviewers?

Adding more potential machine pass reviewers.
Was the question from @nikic (about making this an IR pass) answered?

Adding more potential machine pass reviewers.
Was the question from @nikic (about making this an IR pass) answered?

It was answered in https://reviews.llvm.org/D119916#3327325.

lebedev.ri resigned from this revision.Jan 12 2023, 5:30 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.

An example this patch can improve: https://godbolt.org/z/qeq7TdrjE

doing this at the MIR level seems slightly better conceptually, but I think an MIR pass is more complicated than recognizing the pattern at the IR level (aka you might get a review faster with an IR pass)

Hi @Carrot,

I can help get the patch reviewed if it is still relevant.
Just one high-level remark: the changes in the peephole optimizer should be a separate patch, right?

Cheers,
-Quentin

Hi Quentin, thank you for your help! This optimization is still relevant to us.

This optimization calls target function FoldImmediate, it is also called by peephole optimization. X86 didn't implement it previously. After adding this function, it impacts peephole at the same time, some of them are regressions, so I changed peephole to avoid them.

Can the peephole part stand by itself?

Can the peephole part stand by itself?

It can, but it's difficult to write a test case for it. At least I need to add X86's FoldImmediate with it together.

Partial review

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1670

Could you add a comment explaining the relationship between UseReg and the other arguments?

I'm guessing UseReg is what defines ImmVal.
In other words, putting all the arguments together, we're looking at a pattern like this:

UseReg = loadImm ImmVal
... = UseMI(..., UseReg)

While at it, would be worth adding a \pre UseReg is used in UseMI.

1670

Side remarks.
I wonder if we should acknowledge here that TTI also has isLegalXXXImmediate and how this is different.

The only reason I mention it is because when I first look at this patch the new API looked familiar and I was wondering if we were re-implementing something that already exists.

llvm/lib/CodeGen/DupConstPhiUsers.cpp
11

You mentioned in the description of the PR that this pass does both code size and performance improvements.
May be worth saying that here again.

I haven't read the rest of the patch yet, but it would be good to describe the cost model as well somewhere.
E.g., what would happen with something like:

a = ldimm cst1
if ()
  b = ldimm cst2
c = phi(a, b)
= add ..., c

What prevents us for having 2 adds on the path a -> b -> c, which I believe would be a perf regression.

127

Function names should start with lower case.

Carrot updated this revision to Diff 544192.Jul 25 2023, 8:32 PM
Carrot marked 3 inline comments as done.
Carrot added inline comments.Jul 25 2023, 8:33 PM
llvm/lib/CodeGen/DupConstPhiUsers.cpp
11

The main cost model checking and description is in the for loop of function FindCandidateInfo.

This patch will not transform this case because critical edge (a -> c) is not allowed, it is also explained in FindCandidateInfo.

qcolombet added inline comments.Jul 26 2023, 1:54 AM
llvm/lib/CodeGen/DupConstPhiUsers.cpp
11

Okay I'll take a closer look, but here a -> c is not a critical edge.

qcolombet added inline comments.Jul 26 2023, 3:59 AM
llvm/lib/CodeGen/DupConstPhiUsers.cpp
11

Scratch that 🙃

llvm/test/CodeGen/X86/lrshrink.ll