Page MenuHomePhabricator

[PowerPC] Remove redundant load immediate instructions
ClosedPublic

Authored by Yi-Hong.Lyu on Jul 4 2019, 12:59 PM.

Details

Summary

Currently PowerPC backend emits code like this:

r3 = li 0
std r3, 264(r1)
r3 = li 0
std r3, 272(r1)

This patch fixes that and other cases where a register already contains a value that is loaded so we will get:

r3 = li 0
std r3, 264(r1)
std r3, 272(r1)

Diff Detail

Repository
rL LLVM

Event Timeline

Yi-Hong.Lyu created this revision.Jul 4 2019, 12:59 PM
steven.zhang added inline comments.
llvm/test/CodeGen/PowerPC/remove-redundant-load-imm.mir
5 ↗(On Diff #208077)

Could you show me the LLVM IR that produce these two redundant LI ? So that, we could figure out if they can be avoided instead of removing it in the peephole.

nemanjai added inline comments.Jul 5 2019, 11:17 AM
llvm/test/CodeGen/PowerPC/remove-redundant-load-imm.mir
5 ↗(On Diff #208077)

More or less any IR that has multiple PHI nodes that have a zero coming in from the same block and the register has to be spilled.
Perhaps we should add a test case such as that (with an inline asm call that clobbers registers thereby requiring spills).

Is this really a PowerPC-specific problem? Do we just want to have a general post-RA cleanup that checks for isMoveImmediate() and does this cleanup?

steven.zhang added inline comments.Jul 10 2019, 7:17 PM
llvm/test/CodeGen/PowerPC/remove-redundant-load-imm.mir
5 ↗(On Diff #208077)

Yeah, at least one case to indicate the scenario that produce this pattern,

Added LLVM IR testcases as requested.

Yi-Hong.Lyu marked an inline comment as done.Jul 17 2019, 7:58 AM
Yi-Hong.Lyu added a subscriber: t.p.northover.
Yi-Hong.Lyu added inline comments.
llvm/test/CodeGen/PowerPC/remove-redundant-load-imm.mir
5 ↗(On Diff #208077)

There are two test cases in the added remove-redundant-load-imm.ll. The first one only has redundancy on PowerPC, the second one has redundancy for both PowerPC and armv8 (not arm64):

$ cat test1.ll
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

define void @hoge(i1 %arg7) {
bb:
  br label %bb10

bb10:                                             ; preds = %bb
  call void @barney.88(i1 %arg7, i32* null)
  ret void
}

declare void @barney.88(i1, i32*)

$ $ $ORI_BIN/llc -O3 -filetype=asm test1.ll -o -
...
# %bb.0:                                # %bb
        mflr 0
        andi. 3, 3, 1
        std 0, 16(1)
        stdu 1, -32(1)
        .cfi_def_cfa_offset 32
        .cfi_offset lr, 16
        li 3, 1
        li 4, 0
        isel 3, 3, 4, 1
        li 4, 0    # redundant load immediate
        bl barney.88
        nop
        addi 1, 1, 32
        ld 0, 16(1)
        mtlr 0
        blr
        .long   0
        .quad   0
...

$ $OPT_BIN/llc -O3 -filetype=asm test1.ll -o -
...
# %bb.0:                                # %bb
        mflr 0
        andi. 3, 3, 1
        std 0, 16(1)
        stdu 1, -32(1)
        .cfi_def_cfa_offset 32
        .cfi_offset lr, 16
        li 3, 1
        li 4, 0
        isel 3, 3, 4, 1
        bl barney.88
        nop
        addi 1, 1, 32
        ld 0, 16(1)
        mtlr 0
        blr
        .long   0
        .quad   0
...
$ cat test2.ll
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

@global.6 = external global i32*

declare void @barney.94(i8*, i32)

define void @0() {
  store i32* null, i32** @global.6
  call void @barney.94(i8* undef, i32 0)
  unreachable
}

$ $ORI_BIN/llc -O3 -filetype=asm test2.ll -o -
...
# %bb.0:
        mflr 0
        std 0, 16(1)
        stdu 1, -32(1)
        .cfi_def_cfa_offset 32
        .cfi_offset lr, 16
        addis 3, 2, .LC0@toc@ha
        li 4, 0
        ld 3, .LC0@toc@l(3)
        std 4, 0(3)
        li 4, 0    # redundant load immediate
        bl barney.94
        nop
        .long   0
        .quad   0
...

$ $OPT_BIN/llc -O3 -filetype=asm test2.ll -o -
...
# %bb.0:
        mflr 0
        std 0, 16(1)
        stdu 1, -32(1)
        .cfi_def_cfa_offset 32
        .cfi_offset lr, 16
        addis 3, 2, .LC0@toc@ha
        li 4, 0
        ld 3, .LC0@toc@l(3)
        std 4, 0(3)
        bl barney.94
        nop
        .long   0
        .quad   0
...

$ $ORI_BIN/llc -O3 -mtriple=armv8-eabi -filetype=asm test2.ll -o -
...
@ %bb.0:
        .save   {r11, lr}
        push    {r11, lr}
        movw    r0, :lower16:global.6
        mov     r1, #0
        movt    r0, :upper16:global.6
        str     r1, [r0]
        mov     r1, #0    @ redundant load immediate
        bl      barney.94
...

Note.

  1. $ORI_BIN/llc is the llc without patch and $OPT_BIN/llc is the llc with patch.
  2. We are evaluating the feasibility of platform independent implementation as Hal suggested and @t.p.northover for any input
stefanp accepted this revision.Jul 19 2019, 2:40 PM
stefanp added a subscriber: stefanp.

I only have a number of very minor comments. Overall I think this looks good.
I think you can fix the comments when you commit.

LGTM.

llvm/lib/Target/PowerPC/PPCPreEmitPeephole.cpp
61 ↗(On Diff #210325)

nit:
use -> used

67 ↗(On Diff #210325)

nit:
Wording.
DeadOrKillToUnset is a pointer to the previous operand has kill/dead flag.
I think this next one sounds a little better:
DeadOrKillToUnset is a pointer to the previous operand that had the kill/dead flag set.

Similarly:
It tracks from the def register of BBI, use registers of AfterBBIs and def registers of AfterBBIs.
Sounds better:
It keeps track of the def register of BBI, the use registers of AfterBBIs and the def registers of AfterBBIs.

The above are just suggestions. If you think another wording sounds better you can use that.

76 ↗(On Diff #210325)

I think it would be easier to read this loop if you just used MBB.instrs() and went through the range that way.
The only place you actually need BBE is as the loop bounds for the inner loop and there you can use MBB.instr_end() directly.

86 ↗(On Diff #210325)

nit:
which operand is a relocation to where the operand is a relocation

106 ↗(On Diff #210325)

So here you can probably just use AfterBBI != MBB.instr_end(). I know that technically saving the value once and using it in the loop is slightly more efficient in terms of performance but I tend to prefer code that is easier to read when the performance difference is really small like this. However, having said that, others might prefer the other way around so if others (you or other reviewers) prefer the other way around we can do that too.

157 ↗(On Diff #210325)

You can use NumRemovedInPreEmit += InstrsToErase.size() outside of the loop.

159 ↗(On Diff #210325)

nit:
How about this?
return !InstrsToErase.empty()

This revision is now accepted and ready to land.Jul 19 2019, 2:40 PM

Address Stefan's review comments

Yi-Hong.Lyu marked 7 inline comments as done.Jul 23 2019, 10:31 AM
This revision was automatically updated to reflect the committed changes.