This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add an MIR pass to replace redundant sext.w instructions with copies.
ClosedPublic

Authored by craig.topper on Dec 29 2021, 11:42 PM.

Details

Summary

Function calls and compare instructions tend to cause sext.w
instructions to be inserted. If we make good use of W instructions,
these operations can often end up being redundant. We don't always
detect these during SelectionDAG due to things like phis. There also
some cases caused by failure to turn extload into sextload in
SelectionDAG. extload selects to LW allowing later sext.ws to become
redundant.

This patch adds a pass that examines the input of sext.w instructions trying
to determine if it is already sign extended. Either by finding a
W instruction, other instructions that produce a sign extended result,
or looking through instructions that propagate sign bits. It uses
a worklist and visited set to search as far back as necessary.

Diff Detail

Event Timeline

craig.topper created this revision.Dec 29 2021, 11:42 PM
craig.topper requested review of this revision.Dec 29 2021, 11:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 29 2021, 11:42 PM
Herald added a subscriber: MaskRay. · View Herald Transcript
asb added a comment.Dec 30 2021, 3:22 AM

This is a nice improvement. I'm seeing a few cases where additional moves are generated when this pass is enabled. Presumably this is exposing a problem elsewhere (or just a case where the register allocator is slightly less good) - but I'm wondering if you have looked at this at all?

e.g. 20020529-1.c from the GCC torture suite:

--- a/output_rv64imafdc_lp64_O3/20020529-1.s
+++ b/output_rv64imafdc_lp64_O3/20020529-1.s
@@ -39,9 +39,10 @@ foo:                                    # @foo
        bnez    a0, .LBB0_5
 # %bb.9:                                # %if.end8.us.us
                                         #   in Loop: Header=BB0_8 Depth=1
-       sext.w  a2, a4
-       addiw   a4, a4, 1
-       bne     a2, a1, .LBB0_8
+       addiw   a2, a4, 1
+       mv      a3, a4
+       mv      a4, a2
+       bne     a3, a1, .LBB0_8
 .LBB0_10:                               # %if.then
        lui     a0, %hi(f1.beenhere)
        li      a1, 2
llvm/lib/Target/RISCV/RISCVSExtWRemoval.cpp
62

Is it just the AMO operations missing?

Are the missing instructions not listed due to further work being needed, or just lack of test coverage?

llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
202

Only add the pass if targeting RV64?

craig.topper added inline comments.Dec 30 2021, 8:37 AM
llvm/lib/Target/RISCV/RISCVSExtWRemoval.cpp
62

The bitmanip instructions that weren’t up for ratification are also missing. Partly because I wasn’t sure if I we’d just have to delete it someday.

Only add the pass on RV64 target.

craig.topper added a comment.EditedDec 30 2021, 10:17 AM

This is a nice improvement. I'm seeing a few cases where additional moves are generated when this pass is enabled. Presumably this is exposing a problem elsewhere (or just a case where the register allocator is slightly less good) - but I'm wondering if you have looked at this at all?

e.g. 20020529-1.c from the GCC torture suite:

--- a/output_rv64imafdc_lp64_O3/20020529-1.s
+++ b/output_rv64imafdc_lp64_O3/20020529-1.s
@@ -39,9 +39,10 @@ foo:                                    # @foo
        bnez    a0, .LBB0_5
 # %bb.9:                                # %if.end8.us.us
                                         #   in Loop: Header=BB0_8 Depth=1
-       sext.w  a2, a4
-       addiw   a4, a4, 1
-       bne     a2, a1, .LBB0_8
+       addiw   a2, a4, 1
+       mv      a3, a4
+       mv      a4, a2
+       bne     a3, a1, .LBB0_8
 .LBB0_10:                               # %if.then
        lui     a0, %hi(f1.beenhere)
        li      a1, 2

This looks like a bad scheduling decision. The NoSchedModel gave a latency of 1 for the sext.w, but a latency of 0 for the COPY that replaced it. With the latency of 1 the sext.w got scheduled before the addiw, with the latency of 0 the copy got scheduled after.

With -mcpu=rocket-rv64 the COPY gets scheduled before.

This is a nice improvement. I'm seeing a few cases where additional moves are generated when this pass is enabled. Presumably this is exposing a problem elsewhere (or just a case where the register allocator is slightly less good) - but I'm wondering if you have looked at this at all?

e.g. 20020529-1.c from the GCC torture suite:

--- a/output_rv64imafdc_lp64_O3/20020529-1.s
+++ b/output_rv64imafdc_lp64_O3/20020529-1.s
@@ -39,9 +39,10 @@ foo:                                    # @foo
        bnez    a0, .LBB0_5
 # %bb.9:                                # %if.end8.us.us
                                         #   in Loop: Header=BB0_8 Depth=1
-       sext.w  a2, a4
-       addiw   a4, a4, 1
-       bne     a2, a1, .LBB0_8
+       addiw   a2, a4, 1
+       mv      a3, a4
+       mv      a4, a2
+       bne     a3, a1, .LBB0_8
 .LBB0_10:                               # %if.then
        lui     a0, %hi(f1.beenhere)
        li      a1, 2

This looks like a bad scheduling decision. The NoSchedModel gave a latency of 1 for the sext.w, but a latency of 0 for the COPY that replaced it. With the latency of 1 the sext.w got scheduled before the addiw, with the latency of 0 the copy got scheduled after.

With -mcpu=rocket-rv64 the COPY gets scheduled before.

This is also caused by loop strength reduce not converting the loop exit condition to use the post inc value. This may be related to i32 not being a native type in datalayout. See also https://lists.llvm.org/pipermail/llvm-dev/2021-December/154474.html

In D116397#3214159, @craig.topper wrote:Does it possible to just propagate the register IFF there is only one def (no phi)?

This is a nice improvement. I'm seeing a few cases where additional moves are generated when this pass is enabled. Presumably this is exposing a problem elsewhere (or just a case where the register allocator is slightly less good) - but I'm wondering if you have looked at this at all?

e.g. 20020529-1.c from the GCC torture suite:

--- a/output_rv64imafdc_lp64_O3/20020529-1.s
+++ b/output_rv64imafdc_lp64_O3/20020529-1.s
@@ -39,9 +39,10 @@ foo:                                    # @foo
        bnez    a0, .LBB0_5
 # %bb.9:                                # %if.end8.us.us
                                         #   in Loop: Header=BB0_8 Depth=1
-       sext.w  a2, a4
-       addiw   a4, a4, 1
-       bne     a2, a1, .LBB0_8
+       addiw   a2, a4, 1
+       mv      a3, a4
+       mv      a4, a2
+       bne     a3, a1, .LBB0_8
 .LBB0_10:                               # %if.then
        lui     a0, %hi(f1.beenhere)
        li      a1, 2

This looks like a bad scheduling decision. The NoSchedModel gave a latency of 1 for the sext.w, but a latency of 0 for the COPY that replaced it. With the latency of 1 the sext.w got scheduled before the addiw, with the latency of 0 the copy got scheduled after.

With -mcpu=rocket-rv64 the COPY gets scheduled before.

This is also caused by loop strength reduce not converting the loop exit condition to use the post inc value. This may be related to i32 not being a native type in datalayout. See also https://lists.llvm.org/pipermail/llvm-dev/2021-December/154474.html

Does it possible to just propagate the register IFF there is only one def (no phi) rather than insert COPY? so that the extra mv can gone?

Oh I take a closer look, one def (no phi) or not should not matter, I tried to add MachineCopyPropagation but not work since it require non-SSA, in theory we could remove that by a simple copy propagation after this pass.

Before:

bb.5.if.end.us.us:
; predecessors: %bb.4, %bb.6
  successors: %bb.6(0x7ffff800), %bb.11(0x00000800); %bb.6(100.00%), %bb.11(0.00%)

  %3:gpr = PHI %1:gpr, %bb.4, %4:gpr, %bb.6
  %23:gpr = COPY $x0
  BNE %22:gpr, %23:gpr, %bb.11
  PseudoBR %bb.6

bb.6.if.end8.us.us:
; predecessors: %bb.5
  successors: %bb.10(0x04000000), %bb.5(0x7c000000); %bb.10(3.12%), %bb.5(96.88%)

  %4:gpr = ADDIW %3:gpr, 1
  %25:gpr = ADDIW %3:gpr, 0
  BEQ killed %25:gpr, %21:gpr, %bb.10
  PseudoBR %bb.5

After:

bb.5.if.end.us.us:
; predecessors: %bb.4, %bb.6 
  successors: %bb.6(0x7ffff800), %bb.11(0x00000800); %bb.6(100.00%), %bb.11(0.00%)

  %3:gpr = PHI %1:gpr, %bb.4, %4:gpr, %bb.6 
  %23:gpr = COPY $x0
  BNE %22:gpr, %23:gpr, %bb.11
  PseudoBR %bb.6 

bb.6.if.end8.us.us:
; predecessors: %bb.5 
  successors: %bb.10(0x04000000), %bb.5(0x7c000000); %bb.10(3.12%), %bb.5(96.88%)

  %4:gpr = ADDIW %3:gpr, 1
  %25:gpr = COPY %3:gpr
  BEQ killed %25:gpr, %21:gpr, %bb.10
  PseudoBR %bb.5

Merge the source and destination register instead of inserting a copy.

Constrain the reg class before merging the registers. If we can't constrain, don't replace the sext.w.

Quick patch for that, but seems also change the BB layout of 20020529-1...

diff --git a/llvm/lib/Target/RISCV/RISCVSExtWRemoval.cpp b/llvm/lib/Target/RISCV/RISCVSExtWRemoval.cpp
index acd79b4ab742..e8774a033ca3 100644
--- a/llvm/lib/Target/RISCV/RISCVSExtWRemoval.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSExtWRemoval.cpp
@@ -251,9 +251,19 @@ bool RISCVSExtWRemoval::runOnMachineFunction(MachineFunction &MF) {
         continue;
 
       LLVM_DEBUG(dbgs() << "Removing redundant sign-extension\n");
-      BuildMI(MBB, MI, MI->getDebugLoc(), TII->get(RISCV::COPY),
-              MI->getOperand(0).getReg())
-          .add(MI->getOperand(1));
+      Register DstReg = MI->getOperand(0).getReg();
+      LLVM_DEBUG(dbgs() << "Replace " << DstReg << " to " << SrcReg << "\n");
+      // Rewrite all uses.
+      // const TargetRegisterInfo &TRI = ST.getRegisterInfo();
+      const TargetRegisterInfo *TRI =
+          MF.getSubtarget<RISCVSubtarget>().getRegisterInfo();
+      for (MachineBasicBlock &MBB : MF) {
+        for (auto I = MBB.begin(), IE = MBB.end(); I != IE;) {
+          MachineInstr *MI = &*I++;
+          if (auto MO = MI->findRegisterUseOperand(DstReg))
+            MO->substVirtReg(SrcReg, 0, *TRI);
+        }
+      }
       MI->eraseFromParent();
       ++NumRemovedSExtW;
       MadeChange = true;

LGTM without copy prorogation improvement, copy prorogation might cause extra code gen issue, so I think that part could improve later with more investigation.

Oh, my mistake I missed the updated version, LGTM.

kito-cheng accepted this revision.Jan 3 2022, 7:37 PM
This revision is now accepted and ready to land.Jan 3 2022, 7:37 PM
asb accepted this revision.Jan 6 2022, 6:50 AM

This LGTM, thanks!

This revision was landed with ongoing or failed builds.Jan 6 2022, 8:24 AM
This revision was automatically updated to reflect the committed changes.