[RISCV] Add peepholes for Global Address lowering patterns
ClosedPublic

Authored by sabuasal on Apr 17 2018, 4:13 PM.

Details

Summary

Base and offset are always separated when a GlobalAddress node is lowered
(rL332641) as an optimization to reduce instruction count. However, this
optimization is not profitable if the Global Address ends up being used in only
instruction.

This patch adds peephole optimizations that merge an offset of
an address calculation into the LUI %%hi and ADD %lo of the lowering sequence.

The peephole handles three patterns:

1) ADDI (ADDI (LUI %hi(global)) %lo(global)), offset
     --->
      ADDI (LUI %hi(global + offset)) %lo(global + offset).

This generates:

lui a0, hi (global + offset)
add a0, a0, lo (global + offset)
Instead of
lui a0, hi (global)
 addi a0, hi (global)
 addi a0, offset
This pattern is for cases when the offset is small enough to fit in the
immediate filed of ADDI (less than 12 bits).
2) ADD ((ADDI (LUI %hi(global)) %lo(global)), (LUI hi_offset))
    --->
     offset = hi_offset << 12
     ADDI (LUI %hi(global + offset)) %lo(global + offset)
Which generates the ASM:
lui  a0, hi(global + offset)
addi a0, lo(global + offset)
Instead of:
lui  a0, hi(global)
addi a0, lo(global)
lui a1, (offset)
add a0, a0, a1
This pattern is for cases when the offset doesn't fit in an immediate field
of ADDI but the lower 12 bits are all zeros.
3)  ADD ((ADDI (LUI %hi(global)) %lo(global)), (ADDI lo_offset, (LUI hi_offset)))
    --->
       offset = global + offhi20<<12 + offlo12
       ADDI (LUI %hi(global + offset)) %lo(global + offset)
Which generates the ASM:
lui  a1, %hi(global + offset)
addi a1, %lo(global + offset)
Instead of:
lui  a0, hi(global)
addi a0, lo(global)
lui a1, (offhi20)
addi a1, (offlo12)
add a0, a0, a1
This pattern is for cases when the offset doesn't fit in an immediate field
of ADDI and both the lower 1 bits and high 20 bits are non zero.

Diff Detail

sabuasal created this revision.Apr 17 2018, 4:13 PM
asb added a comment.Apr 19 2018, 5:07 AM

Thanks Sameer. This is definitely a win for a range of inputs, but I'm also seeing regressions. An extra instruction is now required for any standalone global access + offset. There are even some cases where the global access is going from 2 instructions to 4 instructions (global with very large offset, requiring lui+addi). Do you have any thoughts on addressing these cases?

The CHECK lines for the new test file are best generated with update_llc_test_checks.py, as that makes it easy to regenerate in the future and normalises whitespace. I'm also seeing test changes for other in-tree tests, which also need regenerating in this patch. I always prefer to have any attributes that aren't necessary to the function of the test removed, so the test file might end up looking like:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=riscv32  < %s | FileCheck  %s

; This test case checks that base for a gloabl address will be reused between
; different accesses. This will allow for a shorter instruction sequence
; for access.
; lui     a0, %hi(s+44)
; addi    a1, zero, 20
; sw      a1, %lo(s+44)(a0)

; lui     a0, %hi(s+8)
; addi    a1, zero, 10
; sw      a1, %lo(s+8)(a0)
; The first function shows the case when the offset is small enough to be
; folded, the second is when the offset is too big.

; TODO: for the big offset, the addi following the lui for the offset
;       can be optimized away by folding the %lo into the immediate part
;       of the sw.

%struct.S = type { i32, i32, %struct.S_nest, i32, i32, %struct.S_nest, i32, i32, %struct.S_nest }
%struct.S_nest = type { i32, i32, i32, i32, i32, i32 }

%struct.S2 = type { i32, [4100 x i32], i32, [4100 x i32], i32 }

@s = global %struct.S zeroinitializer, align 4
@s2 = global %struct.S2 zeroinitializer, align 4

define void @foo() nounwind {
; CHECK-LABEL: foo:
; CHECK:       # %bb.0: # %entry
; CHECK-NEXT:    lui a0, %hi(s)
; CHECK-NEXT:    addi a0, a0, %lo(s)
; CHECK-NEXT:    addi a1, zero, 20
; CHECK-NEXT:    sw a1, 44(a0)
; CHECK-NEXT:    addi a1, zero, 10
; CHECK-NEXT:    sw a1, 8(a0)
; CHECK-NEXT:    addi a1, zero, 30
; CHECK-NEXT:    sw a1, 80(a0)
; CHECK-NEXT:    ret
entry:
  store i32 10, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32 0, i32 2, i32 0), align 4
  store i32 20, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32 0, i32 5, i32 1), align 4
  store i32 30, i32* getelementptr inbounds (%struct.S, %struct.S* @s, i32 0, i32 8, i32 2), align 4
  ret void
}

define void @bar() nounwind {
; CHECK-LABEL: bar:
; CHECK:       # %bb.0: # %entry
; CHECK-NEXT:    lui a0, 8
; CHECK-NEXT:    addi a0, a0, 40
; CHECK-NEXT:    lui a1, %hi(s2)
; CHECK-NEXT:    addi a1, a1, %lo(s2)
; CHECK-NEXT:    add a0, a1, a0
; CHECK-NEXT:    addi a2, zero, 50
; CHECK-NEXT:    sw a2, 0(a0)
; CHECK-NEXT:    lui a0, 4
; CHECK-NEXT:    addi a0, a0, 20
; CHECK-NEXT:    add a0, a1, a0
; CHECK-NEXT:    addi a1, zero, 40
; CHECK-NEXT:    sw a1, 0(a0)
; CHECK-NEXT:    ret
entry:
  store i32 40, i32* getelementptr inbounds (%struct.S2, %struct.S2* @s2, i32 0, i32 2), align 4
  store i32 50, i32* getelementptr inbounds (%struct.S2, %struct.S2* @s2, i32 0, i32 4), align 4
  ret void
}
lib/Target/RISCV/RISCVISelLowering.cpp
221 ↗(On Diff #142858)

There are a few clang-tidy issues in this function (extra space after return, lack of space before RISCVII::MO_HI.

232 ↗(On Diff #142858)

It would be good to add a comment here, something along the lines of:

// Lower the base global address and the offset separately, allowing the base address to be reused across different accesses.

test/CodeGen/RISCV/hoist-global-addr-base.ll
1

Please generate the CHECK lines with update_llc_test_checks.py.

asb added a comment.EditedApr 24 2018, 7:45 AM

What are your thoughts on handling regressions such as those demonstrated in the test case below?

@g = global [128 x i8] zeroinitializer, align 1

define i8* @test() {
  ret i8* getelementptr inbounds ([128 x i8], [128 x i8]* @g, i32 0, i32 125)
}

Before:

	lui	a0, %hi(g+125)
	addi	a0, a0, %lo(g+125)

After:

	lui	a0, %hi(g)
	addi	a0, a0, %lo(g)
	addi	a0, a0, 125

It's tempting to have a peephole that converts the final addi to addi a0, a0, %lo(g+125) but that's not actually correct as the %hi relocation could be affected by the top bit of the %lo due to sign extension. A peephole might still be a sensible approach, but it would need to transform the lui as well.

You can also note that in the worst case with a large offset you end up with an additional two instructions

@g = global [1048576 x i8] zeroinitializer, align 1  

define i8* @test() {
  ret i8* getelementptr inbounds ([1048576 x i8], [1048576 x i8]* @g, i32 0, i32 524288)
}

Before:

	lui	a0, %hi(g+524288)
	addi	a0, a0, %lo(g+524288)

After:

	lui	a0, %hi(g)
	addi	a0, a0, %lo(g)
	lui	a1, 128
	add	a0, a0, a1
sabuasal updated this revision to Diff 146050.May 9 2018, 6:51 PM
  • Updated tests and formatting.
  • Added peepholes to handle corner cases, handling these cases with dag combiner doesn't work for uses spanning multiple basic blocks (dag combiner works on BB level),
asb added a comment.May 15 2018, 9:15 AM

Thanks for the update Sameer. I've been evaluating this patch today. I was curious about the decision to perform a peephole rather than a DAG combine. You mention that the DAG combiner doesn't work for uses spanning multiple basic blocks, but doesn't the SelectionDAG (and hence RISCVISelDAGToDAG) also work on a per-basic-block granularity?

The following IR demonstrates a missed opportunity for rewriting a global to be global+offset and deleting an addi. Specifically you end up with lui+addi+lh, which can be reduced to lui+lh by merging the offset back into the GlobalAddress and using that in the lh, thus allowing the addi to be deleted.

@foo = global [6 x i16] [i16 1, i16 2, i16 3, i16 4, i16 5, i16 0], align 2                                     
                                                                                                                
define i32 @main() nounwind {                                                                                   
entry:                                                                                                          
  %0 = load i16, i16* getelementptr inbounds ([6 x i16], [6 x i16]* @foo, i32 0, i32 4), align 2                
  %cmp = icmp eq i16 %0, 140                                                                                    
  br i1 %cmp, label %if.end, label %if.then                                                                     
                                                                                                                
if.then:                                                                                                        
  tail call void @abort()                                                                                       
  unreachable                                                                                                   
                                                                                                                
if.end:                                                                                                         
  ret i32 0                                                                                                     
}                                                                                                               
                                                                                                                
declare void @abort()

A quick and dirty addition to doPeepholeLoadStoreADDI can resolve this (intended to be demonstrative rather than commitable code):

 // Merge an ADDI into the offset of a load/store instruction where possible.
 // (load (add base, off), 0) -> (load base, off)
 // (store val, (add base, off)) -> (store val, base, off)
@@ -171,19 +375,51 @@ void RISCVDAGToDAGISel::doPeepholeLoadStoreADDI() {
       break;
     }
 
-    // Currently, the load/store offset must be 0 to be considered for this
-    // peephole optimisation.
-    if (!isa<ConstantSDNode>(N->getOperand(OffsetOpIdx)) ||
-        N->getConstantOperandVal(OffsetOpIdx) != 0)
-      continue;
-
     SDValue Base = N->getOperand(BaseOpIdx);
 
     // If the base is an ADDI, we can merge it in to the load/store.
     if (!Base.isMachineOpcode() || Base.getMachineOpcode() != RISCV::ADDI)
       continue;
 
+    if (!isa<ConstantSDNode>(N->getOperand(OffsetOpIdx)))
+     continue;
+   
+    uint64_t Offset = N->getConstantOperandVal(OffsetOpIdx);
     SDValue ImmOperand = Base.getOperand(1);
+    SDValue HBase = Base.getOperand(0);
+
+    // Base is addi, offset is non-zero, addi has a global parameter, addi 
+    // parent is lui, addi+lui have only one use then do the global match.
+    if (Offset != 0 && isa<GlobalAddressSDNode>(ImmOperand) && 
+        HBase.isMachineOpcode() && HBase.getMachineOpcode() == RISCV::LUI && 
+        isa<GlobalAddressSDNode>(HBase.getOperand(0)) && Base.hasOneUse() && 
+          HBase.hasOneUse()) {
+      auto GALo = cast<GlobalAddressSDNode>(ImmOperand);
+      // Change the ImmOperand to be a global address with new offset
+      SDValue GALoNew = CurDAG->getTargetGlobalAddress(
+          GALo->getGlobal(), SDLoc(ImmOperand), ImmOperand.getValueType(),
+          GALo->getOffset() + Offset, GALo->getTargetFlags());
+      // Change the HBase immoperand to be a global address with new offset
+      auto GAHi = cast<GlobalAddressSDNode>(HBase.getOperand(0));
+      SDValue GAHiNew = CurDAG->getTargetGlobalAddress(
+          GAHi->getGlobal(), SDLoc(ImmOperand), ImmOperand.getValueType(),
+          GAHi->getOffset() + Offset, GAHi->getTargetFlags());
+      if (BaseOpIdx == 0) // Load
+        CurDAG->UpdateNodeOperands(N, Base.getOperand(0), GALoNew,
+                                 N->getOperand(2));
+      else // Store
+        CurDAG->UpdateNodeOperands(N, N->getOperand(0), Base.getOperand(0),
+            GALoNew, N->getOperand(3));
+      CurDAG->UpdateNodeOperands(HBase.getNode(), GAHiNew);
+      // Remove dead nodes
+      CurDAG->RemoveDeadNodes();
+      continue;
+    }
+
+    // Currently, the load/store offset must be 0 to be considered for this
+    // peephole optimisation.
+    if (Offset != 0)
+      continue;
 
     if (auto Const = dyn_cast<ConstantSDNode>(ImmOperand)) {
       ImmOperand = CurDAG->getTargetConstant(

But I do see some cases where this introduces an extra instruction due to the basic-block granularity, as the global base is no longer reused across basic blocks. Hopefully this is all much easier to handle with GlobalISel.

asb added a comment.May 15 2018, 1:34 PM

I looked at this in some more detail. I have a few very rough statistics on the effect of the various peepholes on the GCC torture suite. I don't claim this is representative of real code, but of course global accesses are highly codebase dependent anyway. Stats are just on the total static instruction count, i.e. it doesn't take into account if instructions are moved from hot to cold basic blocks or vice versa.

Improvement of your proposed peephole over your patch with the peephole disabled:

  • 26 files have reduced number of instructions
  • 10 have an increased number of instructions
  • 6 have changed output but the same static instruction count

If you then add my proposed peephole on top of that, then you see the following changes (baseline = this patch including your peephole)

  • 23 files have reduced number of instructions
  • 11 have increased number of instructions
  • 2 have changed output but the same static instruction count

If it were doable, the ideal starting point would be the following:

  • If the global base has only a single reference across the whole function, or every reference has the same offset then combine the global with offset
  • If the global base has multiple references with different offsets, then never combine the global with the offset

I think the above two rules would be sensible regardless of whether you are linking statically or e.g. dynamic linking with PIC. Unfortunately we don't have access to that information at the point we'd like to make a decision about combining an offset. A particular lui+addi of a global may have only a single use, but that global may be referenced many times in the function. New TargetGlobalAddress nodes are introduced each time rather than reusing them. In these cases, MachineCSE can remove redundant instructions as long as different offsets aren't folded into the global. Unless we can force the reuse of identical TargetGlobalAddress nodes we can't easily write a peephole pass that respects the two rules proposed above. This input shows the problem well:

@a = internal unnamed_addr global [4 x i32] zeroinitializer, align 4

; Function Attrs: noreturn nounwind
define dso_local i32 @main() local_unnamed_addr #0 {
entry:
  %0 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 0), align 4
  %cmp = icmp eq i32 %0, 0
  br i1 %cmp, label %if.end, label %if.then

if.then:                                          ; preds = %entry
  tail call void @abort() #3
  unreachable

if.end:                                           ; preds = %entry
  %1 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 1), align 4
  %cmp1 = icmp eq i32 %1, 3
  br i1 %cmp1, label %if.end3, label %if.then2

if.then2:                                         ; preds = %if.end
  tail call void @abort() #3
  unreachable

if.end3:                                          ; preds = %if.end
  %2 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 2), align 4
  %cmp4 = icmp eq i32 %2, 2
  br i1 %cmp4, label %if.end6, label %if.then5

if.then5:                                         ; preds = %if.end3
  tail call void @abort() #3
  unreachable

if.end6:                                          ; preds = %if.end3
  %3 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 3), align 4
  %cmp7 = icmp eq i32 %3, 1
  br i1 %cmp7, label %if.end9, label %if.then8

if.then8:                                         ; preds = %if.end6
  tail call void @abort() #3
  unreachable

if.end9:                                          ; preds = %if.end6
  tail call void @exit(i32 0) #3
  unreachable
}

declare void @abort()

declare void @exit(i32)

So, what is the path forwards here? It feels like we should try to do something sensible without adding too much complex logic. Without a whole-function view we're not going to be better in all cases. Having the combines discussed in this thread feels like a reasonable starting point, but it is possible to imaging pathological inputs with uses spread across many basic blocks. Maybe these combines should be enabled only for a static relocation model?

I'd welcome further thoughts.

asb added a comment.May 16 2018, 3:11 AM

I've started a discussion on GlobalAddress lowering strategy on llvm-dev here: http://lists.llvm.org/pipermail/llvm-dev/2018-May/123359.html

Your peephole code seems well written and really solid, but I'd like to discuss the dagcombiner approach some more. The sort of approach taken in rL330630 seems to catch the cases discussed in this thread and has a simpler implementation.

asb added a comment.May 16 2018, 4:27 AM

I think we've all spent enough time looking at this to see that separating the base from offset in lowerGlobalAddress is a better starting point, even though there are a range of examples where it results in more instructions. I think we'll soon arrive at a solution that fixes the most common of those cases, but it's obviously not quite as straight-forward as originally hoped. If you'd like to land the change to lowerGlobalAddress quickly, I'd happily now review that as a standalone change. If you like, you could create a new review thread that includes that change as well as test cases that demonstrates where it helps, and cases where it's worse marked with TODO.

sabuasal added a comment.EditedMay 16 2018, 10:26 AM
In D45748#1099689, @asb wrote:

Thanks for the update Sameer. I've been evaluating this patch today. I was curious about the decision to perform a peephole rather than a DAG combine. You mention that the DAG combiner doesn't work for uses spanning multiple basic blocks, but doesn't the SelectionDAG (and hence RISCVISelDAGToDAG) also work on a per-basic-block granularity?

Yes. Here is the problem with DAG combine:

  • In our optimization we lower the Global address in a target global address and an ISD::ADD (a generic node, not target specific node)
  • We create a DAG combine that does a call back for ISD::ADD.
  • Right after the lowerGlobalAddress is done you hit the DAG combiner function. Now at this point other blocks are yet to be processed and the global address there is still not lowered into a TargetGlobalAddress, in other blocks (as listed in the the IR test I added called "control_flow") so the Target node we created only has one use only. Our combiner check will go back and put the offset back in the target lowering because it simply doesn't know.

    In a peephole, I am only checking the target specific RISCV::ADD and RISCV::ADDI, so my check only kicks in and merges the offset back in to the global address after the block that has the first global address node has been visited.

    This would have been doable by DAG combiner, but the problem is you cant request a call back for a TargetSpecific node maybe but the framework doesn't let you do that.

The following IR demonstrates a missed opportunity for rewriting a global to be global+offset and deleting an addi. Specifically you end up with lui+addi+lh, which can be reduced to lui+lh by merging the offset back into the GlobalAddress and using that in the lh, thus allowing the addi to be deleted.

@foo = global [6 x i16] [i16 1, i16 2, i16 3, i16 4, i16 5, i16 0], align 2                                     
                                                                                                                
define i32 @main() nounwind {                                                                                   
entry:                                                                                                          
  %0 = load i16, i16* getelementptr inbounds ([6 x i16], [6 x i16]* @foo, i32 0, i32 4), align 2                
  %cmp = icmp eq i16 %0, 140                                                                                    
  br i1 %cmp, label %if.end, label %if.then                                                                     
                                                                                                                
if.then:                                                                                                        
  tail call void @abort()                                                                                       
  unreachable                                                                                                   
                                                                                                                
if.end:                                                                                                         
  ret i32 0                                                                                                     
}                                                                                                               
                                                                                                                
declare void @abort()

A quick and dirty addition to doPeepholeLoadStoreADDI can resolve this (intended to be demonstrative rather than commitable code):

 // Merge an ADDI into the offset of a load/store instruction where possible.
 // (load (add base, off), 0) -> (load base, off)
 // (store val, (add base, off)) -> (store val, base, off)
@@ -171,19 +375,51 @@ void RISCVDAGToDAGISel::doPeepholeLoadStoreADDI() {
       break;
     }
 
-    // Currently, the load/store offset must be 0 to be considered for this
-    // peephole optimisation.
-    if (!isa<ConstantSDNode>(N->getOperand(OffsetOpIdx)) ||
-        N->getConstantOperandVal(OffsetOpIdx) != 0)
-      continue;
-
     SDValue Base = N->getOperand(BaseOpIdx);
 
     // If the base is an ADDI, we can merge it in to the load/store.
     if (!Base.isMachineOpcode() || Base.getMachineOpcode() != RISCV::ADDI)
       continue;
 
+    if (!isa<ConstantSDNode>(N->getOperand(OffsetOpIdx)))
+     continue;
+   
+    uint64_t Offset = N->getConstantOperandVal(OffsetOpIdx);
     SDValue ImmOperand = Base.getOperand(1);
+    SDValue HBase = Base.getOperand(0);
+
+    // Base is addi, offset is non-zero, addi has a global parameter, addi 
+    // parent is lui, addi+lui have only one use then do the global match.
+    if (Offset != 0 && isa<GlobalAddressSDNode>(ImmOperand) && 
+        HBase.isMachineOpcode() && HBase.getMachineOpcode() == RISCV::LUI && 
+        isa<GlobalAddressSDNode>(HBase.getOperand(0)) && Base.hasOneUse() && 
+          HBase.hasOneUse()) {
+      auto GALo = cast<GlobalAddressSDNode>(ImmOperand);
+      // Change the ImmOperand to be a global address with new offset
+      SDValue GALoNew = CurDAG->getTargetGlobalAddress(
+          GALo->getGlobal(), SDLoc(ImmOperand), ImmOperand.getValueType(),
+          GALo->getOffset() + Offset, GALo->getTargetFlags());
+      // Change the HBase immoperand to be a global address with new offset
+      auto GAHi = cast<GlobalAddressSDNode>(HBase.getOperand(0));
+      SDValue GAHiNew = CurDAG->getTargetGlobalAddress(
+          GAHi->getGlobal(), SDLoc(ImmOperand), ImmOperand.getValueType(),
+          GAHi->getOffset() + Offset, GAHi->getTargetFlags());
+      if (BaseOpIdx == 0) // Load
+        CurDAG->UpdateNodeOperands(N, Base.getOperand(0), GALoNew,
+                                 N->getOperand(2));
+      else // Store
+        CurDAG->UpdateNodeOperands(N, N->getOperand(0), Base.getOperand(0),
+            GALoNew, N->getOperand(3));
+      CurDAG->UpdateNodeOperands(HBase.getNode(), GAHiNew);
+      // Remove dead nodes
+      CurDAG->RemoveDeadNodes();
+      continue;
+    }
+
+    // Currently, the load/store offset must be 0 to be considered for this
+    // peephole optimisation.
+    if (Offset != 0)
+      continue;
 
     if (auto Const = dyn_cast<ConstantSDNode>(ImmOperand)) {
       ImmOperand = CurDAG->getTargetConstant(

But I do see some cases where this introduces an extra instruction due to the basic-block granularity, as the global base is no longer reused across basic blocks. Hopefully this is all much easier to handle with GlobalISel.

sabuasal added a comment.EditedMay 16 2018, 10:31 AM
In D45748#1100079, @asb wrote:

I looked at this in some more detail. I have a few very rough statistics on the effect of the various peepholes on the GCC torture suite. I don't claim this is representative of real code, but of course global accesses are highly codebase dependent anyway. Stats are just on the total static instruction count, i.e. it doesn't take into account if instructions are moved from hot to cold basic blocks or vice versa.

Improvement of your proposed peephole over your patch with the peephole disabled:

  • 26 files have reduced number of instructions
  • 10 have an increased number of instructions
  • 6 have changed output but the same static instruction count

    If you then add my proposed peephole on top of that, then you see the following changes (baseline = this patch including your peephole)
  • 23 files have reduced number of instructions
  • 11 have increased number of instructions
  • 2 have changed output but the same static instruction count

So you are saying your addition hurts size, right?

If it were doable, the ideal starting point would be the following:

  • If the global base has only a single reference across the whole function, or every reference has the same offset then combine the global with offset
  • If the global base has multiple references with different offsets, then never combine the global with the offset

This is exactly what I am doing in my peephole:

  • Catch a tail node (RISCV::ADDI or RISCV::ADD)
  • Look at the operands of the node t detect a global address lowering sequence (LUI %hi followed by and ADDI %lo)
  • Look at the other operands of the tail node to catch the and reconstruct the offset. (it can be just an operand if the tail is an ADDI, or you might have to reconstruct from an LUI + ADDI pair if the offset is big or just an LUI)
  • Merge the offset back into the GlobalAddress if the global address only had one use. We can trust that at this point because ALL the blocks have RISCV target nodes.

I think the above two rules would be sensible regardless of whether you are linking statically or e.g. dynamic linking with PIC. Unfortunately we don't have access to that information at the point we'd like to make a decision about combining an offset. A particular lui+addi of a global may have only a single use, but that global may be referenced many times in the function. New TargetGlobalAddress nodes are introduced each time rather than reusing them. In these cases, MachineCSE can remove redundant instructions as long as different offsets aren't folded into the global. Unless we can force the reuse of identical TargetGlobalAddress nodes we can't easily write a peephole pass that respects the two rules proposed above. This input shows the problem well:

@a = internal unnamed_addr global [4 x i32] zeroinitializer, align 4

; Function Attrs: noreturn nounwind
define dso_local i32 @main() local_unnamed_addr #0 {
entry:
  %0 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 0), align 4
  %cmp = icmp eq i32 %0, 0
  br i1 %cmp, label %if.end, label %if.then

if.then:                                          ; preds = %entry
  tail call void @abort() #3
  unreachable

if.end:                                           ; preds = %entry
  %1 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 1), align 4
  %cmp1 = icmp eq i32 %1, 3
  br i1 %cmp1, label %if.end3, label %if.then2

if.then2:                                         ; preds = %if.end
  tail call void @abort() #3
  unreachable

if.end3:                                          ; preds = %if.end
  %2 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 2), align 4
  %cmp4 = icmp eq i32 %2, 2
  br i1 %cmp4, label %if.end6, label %if.then5

if.then5:                                         ; preds = %if.end3
  tail call void @abort() #3
  unreachable

if.end6:                                          ; preds = %if.end3
  %3 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 3), align 4
  %cmp7 = icmp eq i32 %3, 1
  br i1 %cmp7, label %if.end9, label %if.then8

if.then8:                                         ; preds = %if.end6
  tail call void @abort() #3
  unreachable

if.end9:                                          ; preds = %if.end6
  tail call void @exit(i32 0) #3
  unreachable
}

declare void @abort()

declare void @exit(i32)

Let me take a closer look at this test case. But I think SelectionDAG builder should be able
to merge the TargetNodes if they refer to the same address.

Update: I looked at this test case, here is the ASM we are getting with this patch (separate + peepholes I added). It appears that the lowered target was actually reused (we only have one lui).

main:                                   # @main
# %bb.0:                                # %entry
	addi	sp, sp, -16
	sw	ra, 12(sp)
	lui	a0, %hi(a)
	lw	a1, %lo(a)(a0)     --> %lo folded into l by the AddiLoadStore peephole,
	bnez	a1, .LBB0_5
# %bb.1:                                # %if.end
	addi	a0, a0, %lo(a)       ---> Need to do the %lo here because we lost th %lo(a) by your peephole. Still not a loss in instructions
	lw	a1, 4(a0)              ---> Offset that was separated (4) is folded into the LW.
	addi	a2, zero, 3
	bne	a1, a2, .LBB0_5
# %bb.2:                                # %if.end3
	lw	a1, 8(a0)
	addi	a2, zero, 2
	bne	a1, a2, .LBB0_5
# %bb.3:                                # %if.end6
	lw	a0, 12(a0)
	addi	a1, zero, 1
	bne	a0, a1, .LBB0_5

Without my patch we get two extra instructions:

main:                                   # @main
# %bb.0:                                # %entry
	addi	sp, sp, -16
	sw	ra, 12(sp)
	lui	a0, %hi(a)
	lw	a0, %lo(a)(a0)
	bnez	a0, .LBB0_5
# %bb.1:                                # %if.end
	lui	a0, %hi(a+4)
	lw	a0, %lo(a+4)(a0)
	addi	a1, zero, 3
	bne	a0, a1, .LBB0_5
# %bb.2:                                # %if.end3
	lui	a0, %hi(a+8)
	lw	a0, %lo(a+8)(a0)
	addi	a1, zero, 2
	bne	a0, a1, .LBB0_5
# %bb.3:                                # %if.end6
	lui	a0, %hi(a+12)
	lw	a0, %lo(a+12)(a0)
	addi	a1, zero, 1
	bne	a0, a1, .LBB0_5
# %bb.4:                                # %if.end9
	mv	a0, zero
	call	exit

I am really not sure what you were concerned about, maybe I am missing something? are you referring to DAG.getNode() returning the same node from the FoldingSet? If that is the case then you are right, I do see different node objects returned.

So, what is the path forwards here? It feels like we should try to do something sensible without adding too much complex logic. Without a whole-function view we're not going to be better in all cases. Having the combines discussed in this thread feels like a reasonable starting point, but it is possible to imaging pathological inputs with uses spread across many basic blocks. Maybe these combines should be enabled only for a static relocation model?

I'd welcome further thoughts.

Without a global view for instruction selection you can't 100% catch all cases across blocks. Delaying the check to merge the offset back into the global address by relying on checks that only look for target nodes will help you do that as late as possible.

sabuasal added a comment.EditedMay 16 2018, 10:49 AM
In D45748#1100874, @asb wrote:

I've started a discussion on GlobalAddress lowering strategy on llvm-dev here: http://lists.llvm.org/pipermail/llvm-dev/2018-May/123359.html

Your peephole code seems well written and really solid, but I'd like to discuss the dagcombiner approach some more. The sort of approach taken in rL330630 seems to catch the cases discussed in this thread and has a simpler implementation.

This patch is similar but it handles a different situation. What they do is processing that happens before you lower a GlobalAddressNode into a target specific node. In that case the selection DAG comes in with a generic GlobalAddressNode that has multiple uses (before it is lowered to a target specific node). So what they do is look at all the uses, get the smallest offset in these uses (that all consist of ADD) then updating the offset in the globalAddressnode to be ( uint64_t Offset = MinOffset + GN->getOffset();).

So the dag they are handling looks like this (all generic nodes)
GN = GlobalAddressNode (base + offset)

This GN has multiply uses:

  1. ADDR1 = GN + off1
  2. ADDR2 = GN + off2
  3. ADD3 = GN + off3.

    Now assuming that min (off1, off2, off3) = off1 we get:

    GN = GlobalAddressNode(base + off + off1)
  1. ADDR1 = GN ----> this can be optimized away
  2. ADDR2 = GN + off2 - off1
  3. ADDR3 = GN + off3 - off1

In our optimization, we are operating after that since lowerGlobalAddress() is actually what is creating the ISD::ADD chained to the TargetGlobalAddress.

In D45748#1100940, @asb wrote:

I think we've all spent enough time looking at this to see that separating the base from offset in lowerGlobalAddress is a better starting point, even though there are a range of examples where it results in more instructions. I think we'll soon arrive at a solution that fixes the most common of those cases, but it's obviously not quite as straight-forward as originally hoped. If you'd like to land the change to lowerGlobalAddress quickly, I'd happily now review that as a standalone change. If you like, you could create a new review thread that includes that change as well as test cases that demonstrates where it helps, and cases where it's worse marked with TODO.

Pushed patch here: https://reviews.llvm.org/D46989
Added todo to all the corner cases we discussed in test/CodeGen/RISCV/hoist-global-addr-base.ll

asb added a comment.May 17 2018, 7:24 AM
In D45748#1100079, @asb wrote:

I looked at this in some more detail. I have a few very rough statistics on the effect of the various peepholes on the GCC torture suite. I don't claim this is representative of real code, but of course global accesses are highly codebase dependent anyway. Stats are just on the total static instruction count, i.e. it doesn't take into account if instructions are moved from hot to cold basic blocks or vice versa.

Improvement of your proposed peephole over your patch with the peephole disabled:

  • 26 files have reduced number of instructions
  • 10 have an increased number of instructions
  • 6 have changed output but the same static instruction count

    If you then add my proposed peephole on top of that, then you see the following changes (baseline = this patch including your peephole)
  • 23 files have reduced number of instructions
  • 11 have increased number of instructions
  • 2 have changed output but the same static instruction count

So you are saying your addition hurts size, right?

No, the baseline is this patch and its combines, so it reduces instruction count in a range of cases. Hard to say how representative that is, but that's true of all these discussions!

If it were doable, the ideal starting point would be the following:

  • If the global base has only a single reference across the whole function, or every reference has the same offset then combine the global with offset
  • If the global base has multiple references with different offsets, then never combine the global with the offset

This is exactly what I am doing in my peephole:

  • Catch a tail node (RISCV::ADDI or RISCV::ADD)
  • Look at the operands of the node t detect a global address lowering sequence (LUI %hi followed by and ADDI %lo)
  • Look at the other operands of the tail node to catch the and reconstruct the offset. (it can be just an operand if the tail is an ADDI, or you might have to reconstruct from an LUI + ADDI pair if the offset is big or just an LUI)
  • Merge the offset back into the GlobalAddress if the global address only had one use. We can trust that at this point because ALL the blocks have RISCV target nodes.

Sorry for the confusion. I was being quite precise in saying referring to "references to a global base" rather than discussing uses of a particular globaladdress node. As the recent example I posted shows, there are cases where there are multiple references to the same global base across a function but this isn't reflected in multiple uses of the same node. As I say, your peephole seems thorough and well written, there's just no way to capture this case with it. Maybe if someone really wanted to tackle that issue it would be done as an IR transformation, to ensure a common globaladdress reference is reused.

I think the above two rules would be sensible regardless of whether you are linking statically or e.g. dynamic linking with PIC. Unfortunately we don't have access to that information at the point we'd like to make a decision about combining an offset. A particular lui+addi of a global may have only a single use, but that global may be referenced many times in the function. New TargetGlobalAddress nodes are introduced each time rather than reusing them. In these cases, MachineCSE can remove redundant instructions as long as different offsets aren't folded into the global. Unless we can force the reuse of identical TargetGlobalAddress nodes we can't easily write a peephole pass that respects the two rules proposed above. This input shows the problem well:

@a = internal unnamed_addr global [4 x i32] zeroinitializer, align 4

; Function Attrs: noreturn nounwind
define dso_local i32 @main() local_unnamed_addr #0 {
entry:
  %0 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 0), align 4
  %cmp = icmp eq i32 %0, 0
  br i1 %cmp, label %if.end, label %if.then

if.then:                                          ; preds = %entry
  tail call void @abort() #3
  unreachable

if.end:                                           ; preds = %entry
  %1 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 1), align 4
  %cmp1 = icmp eq i32 %1, 3
  br i1 %cmp1, label %if.end3, label %if.then2

if.then2:                                         ; preds = %if.end
  tail call void @abort() #3
  unreachable

if.end3:                                          ; preds = %if.end
  %2 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 2), align 4
  %cmp4 = icmp eq i32 %2, 2
  br i1 %cmp4, label %if.end6, label %if.then5

if.then5:                                         ; preds = %if.end3
  tail call void @abort() #3
  unreachable

if.end6:                                          ; preds = %if.end3
  %3 = load i32, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i32 0, i32 3), align 4
  %cmp7 = icmp eq i32 %3, 1
  br i1 %cmp7, label %if.end9, label %if.then8

if.then8:                                         ; preds = %if.end6
  tail call void @abort() #3
  unreachable

if.end9:                                          ; preds = %if.end6
  tail call void @exit(i32 0) #3
  unreachable
}

declare void @abort()

declare void @exit(i32)

Let me take a closer look at this test case. But I think SelectionDAG builder should be able
to merge the TargetNodes if they refer to the same address.

Update: I looked at this test case, here is the ASM we are getting with this patch (separate + peepholes I added). It appears that the lowered target was actually reused (we only have one lui).

I'm sure that's not what I was seeing, let me take another look to see if I was doing something dumb.

sabuasal updated this revision to Diff 147368.May 17 2018, 12:18 PM
sabuasal retitled this revision from [RISCV] Separate base from offset in lowerGlobalAddress to [RISCV] Add peepholes for Global Address lowering patterns.
sabuasal edited the summary of this revision. (Show Details)

Updated the patch to only show the peephole optimizations.

sabuasal edited the summary of this revision. (Show Details)May 17 2018, 12:25 PM
sabuasal edited the summary of this revision. (Show Details)May 17 2018, 12:44 PM
sabuasal edited the summary of this revision. (Show Details)May 17 2018, 12:49 PM
sabuasal edited the summary of this revision. (Show Details)
sabuasal edited the summary of this revision. (Show Details)May 17 2018, 12:51 PM
sabuasal marked an inline comment as done.May 23 2018, 12:06 PM

Ping?

asb accepted this revision.May 23 2018, 11:42 PM

The point I was trying to make is that even with these peepholes, there are trivial cases that aren't caught, such as the example I shared:

@foo = global [6 x i16] [i16 1, i16 2, i16 3, i16 4, i16 5, i16 0], align 2                                     
                                                                                                                
define i32 @main() nounwind {                                                                                   
entry:                                                                                                          
  %0 = load i16, i16* getelementptr inbounds ([6 x i16], [6 x i16]* @foo, i32 0, i32 4), align 2                
  %cmp = icmp eq i16 %0, 140                                                                                    
  br i1 %cmp, label %if.end, label %if.then                                                                     
                                                                                                                
if.then:                                                                                                        
  tail call void @abort()                                                                                       
  unreachable                                                                                                   
                                                                                                                
if.end:                                                                                                         
  ret i32 0                                                                                                     
}                                                                                                               
                                                                                                                
declare void @abort()

This generates 3 instructions for the global access when you'd obviously prefer to use two:

lui     a0, %hi(foo)
addi    a0, a0, %lo(foo)
lhu     a0, 8(a0)

I proposed one way of adding a peephole, but don't see a way of differentiating between the case above and more complex cases where the peephole is counter-productive (such as in the longer IR sample I shared). Possible the solution would be to have an IR pass perform common subexpression elimination on the getelementptr calculations - possibly such a pass already exists.

Anyway, is it really important? Probably not going to register in code profiling and probably a tiny impact on code size. I mainly raise it as I know we will all be poring over assembly output to look for missed optimisation opportunities and cases like the above stick out like a sore thumb. It's also the sort of thing compiler users do, and then confidently conclude that the compiler must be dumb.

I'm happy to land this patch as-is. It generates good code in a wide variety of cases (more so than many other backends). But please add the above example to your test file with a note about the missed optimisation opportunity.

This revision is now accepted and ready to land.May 23 2018, 11:42 PM
In D45748#1110644, @asb wrote:

The point I was trying to make is that even with these peepholes, there are trivial cases that aren't caught, such as the example I shared:

@foo = global [6 x i16] [i16 1, i16 2, i16 3, i16 4, i16 5, i16 0], align 2                                     
                                                                                                                
define i32 @main() nounwind {                                                                                   
entry:                                                                                                          
  %0 = load i16, i16* getelementptr inbounds ([6 x i16], [6 x i16]* @foo, i32 0, i32 4), align 2                
  %cmp = icmp eq i16 %0, 140                                                                                    
  br i1 %cmp, label %if.end, label %if.then                                                                     
                                                                                                                
if.then:                                                                                                        
  tail call void @abort()                                                                                       
  unreachable                                                                                                   
                                                                                                                
if.end:                                                                                                         
  ret i32 0                                                                                                     
}                                                                                                               
                                                                                                                
declare void @abort()

This generates 3 instructions for the global access when you'd obviously prefer to use two:

lui     a0, %hi(foo)
addi    a0, a0, %lo(foo)
lhu     a0, 8(a0)

I proposed one way of adding a peephole, but don't see a way of differentiating between the case above and more complex cases where the peephole is counter-productive (such as in the longer IR sample I shared). Possible the solution would be to have an IR pass perform common subexpression elimination on the getelementptr calculations - possibly such a pass already exists.

Hi alex,

Actually, I took a second look today. Here is what is happening with this example:

  • Base and offset are split.
  • Your LoadStoreADDI peephole kicks in and it merges the ADDI of the offset in the Immediate field.
  • My peephole will not kick is, simply because it has no tail node ADDI (i.e a node that adds an offset to base like I described above).

This case is a bit tricky because there is only one use for the GlobalNode and your peephole can even optimize the non-split (base + offset) in the ADDI by folding it in the immediate of lhu., creating:

lui     a0, %hi(foo +8)
lhu     a0, %lo(foo+8)(a0)

I could possibly update my peephole to detect a new kind of "Tail", I can regard an load\store with an immediate field as a tail node and get the offset there and merge it back in the Target lowering of base + offset.

Anyway, is it really important? Probably not going to register in code profiling and probably a tiny impact on code size. I mainly raise it as I know we will all be poring over assembly output to look for missed optimisation opportunities and cases like the above stick out like a sore thumb. It's also the sort of thing compiler users do, and then confidently conclude that the compiler must be dumb.

I agree!

I'm happy to land this patch as-is. It generates good code in a wide variety of cases (more so than many other backends). But please add the above example to your test file with a note about the missed optimisation opportunity.

The example is actually already added to the test (define dso_local i32 @load_half() nounwind).

On the other hand, after some chat with @efriedma I am starting to think that this whole thing could be better handled with a machine function Pass that runs after MachineCSE. Your point in your original longer example (and the example I have in the test case called control_flow) shows that we are getting different notes for the lowered global address (just run llc with --stop-before=machine--cse), hasUseOnce always considers the uses within the basic block you are on, not the whole function (which makes sense). The reason these two examples worked (and by worked I mean NOT having the offset merged back into the global address lowering) is because the ADDI generated for the offset was folded into the Load\STore and my peephole found no Tail to kick in. If we add a MachineFunctionPasss we we will have:

  1. A global view of the whole function so the uses for an MI will return the "True" number of uses over all the blocks.
  2. Since we are running after CSE we will have known any potential value of the separating of base and offset. for the test case we have here, we will break even (in terms of Instruction count) if the Base lowering had two uses at least:
BB0:
         lui     a0, %hi(foo)
         addi    a0, a0, %lo(foo)
BB1:
         lhu     a1, 8(a0)
          ...
BB2:
         lhu     a2, 16(a0)

VS

BB0:
BB1:
         lui     a0, %hi(foo+8)
         lhu     a1, %lo(foo+8)(a0)
          ...
BB2:
         lui     a0, %hi(foo+16)
         lhu     a1, %lo(foo+16)(a0)

And win for cases more than that.

I think we can do these checks with virtual registers in an MF pass.
if you think it is worth it I can give it a shot in an MF Pass :)

asb added a comment.May 24 2018, 12:25 PM
In D45748#1110644, @asb wrote:

I'm happy to land this patch as-is. It generates good code in a wide variety of cases (more so than many other backends). But please add the above example to your test file with a note about the missed optimisation opportunity.

The example is actually already added to the test (define dso_local i32 @load_half() nounwind).

Sorry, I was focusing on the diff and had forgotten you already added this in the previous patch. Thanks!

On the other hand, after some chat with @efriedma I am starting to think that this whole thing could be better handled with a machine function Pass that runs after MachineCSE. Your point in your original longer example (and the example I have in the test case called control_flow) shows that we are getting different notes for the lowered global address (just run llc with --stop-before=machine--cse), hasUseOnce always considers the uses within the basic block you are on, not the whole function (which makes sense). The reason these two examples worked (and by worked I mean NOT having the offset merged back into the global address lowering) is because the ADDI generated for the offset was folded into the Load\STore and my peephole found no Tail to kick in. If we add a MachineFunctionPasss we we will have:

I agree with your analysis. It's conceivable you could have an IR pass that tries to improve codegen, but it would have to operate based on assumption about how global lowering would work. I think as you suggest, a machine function pass is promising.

So would the plan be to always generate offset separate from the global (as is current upstream behaviour after your recent patch), and to opportunistically merge the offset in the case the the global base only has a single reference, or (optionally, probably less common) every reference using that base has the same offset.

In D45748#1111477, @asb wrote:
In D45748#1110644, @asb wrote:

I'm happy to land this patch as-is. It generates good code in a wide variety of cases (more so than many other backends). But please add the above example to your test file with a note about the missed optimisation opportunity.

The example is actually already added to the test (define dso_local i32 @load_half() nounwind).

Sorry, I was focusing on the diff and had forgotten you already added this in the previous patch. Thanks!

On the other hand, after some chat with @efriedma I am starting to think that this whole thing could be better handled with a machine function Pass that runs after MachineCSE. Your point in your original longer example (and the example I have in the test case called control_flow) shows that we are getting different notes for the lowered global address (just run llc with --stop-before=machine--cse), hasUseOnce always considers the uses within the basic block you are on, not the whole function (which makes sense). The reason these two examples worked (and by worked I mean NOT having the offset merged back into the global address lowering) is because the ADDI generated for the offset was folded into the Load\STore and my peephole found no Tail to kick in. If we add a MachineFunctionPasss we we will have:

I agree with your analysis. It's conceivable you could have an IR pass that tries to improve codegen, but it would have to operate based on assumption about how global lowering would work. I think as you suggest, a machine function pass is promising.

So would the plan be to always generate offset separate from the global (as is current upstream behaviour after your recent patch), and to opportunistically merge the offset in the case the the global base only has a single reference, or (optionally, probably less common) every reference using that base has the same offset.

Yes. Are we still OK with committing this patch as is? I plan to add a test case that shows the peephole's in ability to handle a GlobalAddress uses at multiple blocks then commit this patch.

After that I'll look into doing all of this in a Machine Function Pass to handle all the corner cases. If it looks good we can revert this patch and keep the pass.

Does that sound like a good plan?

sabuasal updated this revision to Diff 148486.May 24 2018, 2:20 PM
  • Added a test case that shows this patch inability to deal with global address nodes spanning multiple blocks.
asb added a comment.May 24 2018, 11:21 PM

Yes. Are we still OK with committing this patch as is? I plan to add a test case that shows the peephole's in ability to handle a GlobalAddress uses at multiple blocks then commit this patch.

After that I'll look into doing all of this in a Machine Function Pass to handle all the corner cases. If it looks good we can revert this patch and keep the pass.

Does that sound like a good plan?

Yes, I'm happy to land this patch as-is. Probably stick a TODO comment near the peephole implementation indicating that we expect a MachineFunctionPass would be better, and linking to this review thread. Thanks for all your work on this.

sabuasal updated this revision to Diff 148965.May 29 2018, 12:30 PM
  • Added comment to the peephole to mention that it might be better as a machine function pass.
  • Added one extra test case to test.