This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] add a compress optimization for stack inst.
ClosedPublic

Authored by lcvon007 on Aug 8 2023, 1:34 AM.

Details

Summary

For callee save/restored operation, it mostly uses the
following inst patterns:

sw rs2, offset(x2)
sd rs2, offset(x2)
fsw rs2, offset(x2)
fsd rs2, offset(x2)
lw rd, offset(x2)
ld rd, offset(x2)
flw rd, offset(x2)
fld rd, offset(x2)

and offset decides whether the instructions can be compressed.
now offset 2032 will be set by default if stacksize is larger
than 2^12-1 to save and restore callee saved register, so it
will prevent all the callee saved stack insts be compressed.

Allocate proper offset for stack insts is useful to decrease
the codesize and improve performance and add an option
riscv-compress-stack-inst to control whether to do this
optimization.

Diff Detail

Event Timeline

lcvon007 created this revision.Aug 8 2023, 1:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 1:34 AM
lcvon007 requested review of this revision.Aug 8 2023, 1:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 1:34 AM

code size before and after optimization about spec CPU2006:
400.perlbench : 1146568, 1145352, decrease 1216
403.gcc: 3556640, 3555280, decrease 1360
445.gobmk: 4437552, 4433280 decrease 4272

  1. h264ref: 586776, 585304 decrease 1472

483.xalancbmk: 8011104, 8008944 decrease 2160
435.gromacs: 870344, 867832, decrease 2512
444.namd: 212224, 207936, decrease 4288
447.dealII: 4741400, 4736792, decrease 4608
453.povray: 1075832, 1073400, decrease 2432

wangpc added a comment.EditedAug 8 2023, 4:32 AM

Have you evaluated impact on performance? It seems we will generate more instructions to adjust stack.

Have you evaluated impact on performance? It seems we will generate more instructions to adjust stack.

It may increase two instructions totally to help building the large immediate in prolog and epilog, and it may see some improvement about 1%-2% in some case and some performance degradation in some cases, and I try to provide some detailed data after more testing.

lcvon007 edited reviewers, added: wangpc; removed: frasercrmck.Aug 8 2023, 8:32 PM
lcvon007 updated this revision to Diff 549026.Aug 10 2023, 6:56 AM

Update the method to adjust the FirstSP mount to avoid
adding extra insts.

hi, I have seen some performance degradattion in my first implementation, and I update the logic to adjust the FirstSP amount conservatively so that it will not increase extra instructions, please help review, thanks very much. @wangpc

hi, I have seen some performance degradattion in my first implementation, and I update the logic to adjust the FirstSP amount conservatively so that it will not increase extra instructions, please help review, thanks very much. @wangpc

Can you show the improvement on code size/performance of your new implementation (though the old one should still be valid)?
LGTM in general and I think it is really nice to have such optimization!

I just found a regression:

--- a/llvm/test/CodeGen/RISCV/stack-realignment.ll
+++ b/llvm/test/CodeGen/RISCV/stack-realignment.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+c -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s -check-prefix=RV32I
-; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -mattr=+c -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s -check-prefix=RV64I
 
 declare void @callee(ptr)
@@ -529,56 +529,58 @@ define void @caller_no_realign2048() "no-realign-stack" {
 define void @caller4096() {
 ; RV32I-LABEL: caller4096:
 ; RV32I:       # %bb.0:
-; RV32I-NEXT:    addi sp, sp, -2032
-; RV32I-NEXT:    .cfi_def_cfa_offset 2032
-; RV32I-NEXT:    sw ra, 2028(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    sw s0, 2024(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    addi sp, sp, -256
+; RV32I-NEXT:    .cfi_def_cfa_offset 256
+; RV32I-NEXT:    sw ra, 252(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s0, 248(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    .cfi_offset ra, -4
 ; RV32I-NEXT:    .cfi_offset s0, -8
-; RV32I-NEXT:    addi s0, sp, 2032
+; RV32I-NEXT:    addi s0, sp, 256
 ; RV32I-NEXT:    .cfi_def_cfa s0, 0
-; RV32I-NEXT:    lui a0, 2
-; RV32I-NEXT:    addi a0, a0, -2032
+; RV32I-NEXT:    li a0, 31
+; RV32I-NEXT:    slli a0, a0, 8
 ; RV32I-NEXT:    sub sp, sp, a0
 ; RV32I-NEXT:    srli a0, sp, 12
 ; RV32I-NEXT:    slli sp, a0, 12
 ; RV32I-NEXT:    lui a0, 1
-; RV32I-NEXT:    add a0, sp, a0
+; RV32I-NEXT:    add a0, a0, sp
 ; RV32I-NEXT:    call callee@plt
 ; RV32I-NEXT:    lui a0, 2
 ; RV32I-NEXT:    sub sp, s0, a0
-; RV32I-NEXT:    addi a0, a0, -2032
+; RV32I-NEXT:    li a0, 31
+; RV32I-NEXT:    slli a0, a0, 8
 ; RV32I-NEXT:    add sp, sp, a0
-; RV32I-NEXT:    lw ra, 2028(sp) # 4-byte Folded Reload
-; RV32I-NEXT:    lw s0, 2024(sp) # 4-byte Folded Reload
-; RV32I-NEXT:    addi sp, sp, 2032
+; RV32I-NEXT:    lw ra, 252(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s0, 248(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    addi sp, sp, 256
 ; RV32I-NEXT:    ret
 ;
 ; RV64I-LABEL: caller4096:
 ; RV64I:       # %bb.0:
-; RV64I-NEXT:    addi sp, sp, -2032
-; RV64I-NEXT:    .cfi_def_cfa_offset 2032
-; RV64I-NEXT:    sd ra, 2024(sp) # 8-byte Folded Spill
-; RV64I-NEXT:    sd s0, 2016(sp) # 8-byte Folded Spill
+; RV64I-NEXT:    addi sp, sp, -512
+; RV64I-NEXT:    .cfi_def_cfa_offset 512
+; RV64I-NEXT:    sd ra, 504(sp) # 8-byte Folded Spill
+; RV64I-NEXT:    sd s0, 496(sp) # 8-byte Folded Spill
 ; RV64I-NEXT:    .cfi_offset ra, -8
 ; RV64I-NEXT:    .cfi_offset s0, -16
-; RV64I-NEXT:    addi s0, sp, 2032
+; RV64I-NEXT:    addi s0, sp, 512
 ; RV64I-NEXT:    .cfi_def_cfa s0, 0
-; RV64I-NEXT:    lui a0, 2
-; RV64I-NEXT:    addiw a0, a0, -2032
+; RV64I-NEXT:    li a0, 15
+; RV64I-NEXT:    slli a0, a0, 9
 ; RV64I-NEXT:    sub sp, sp, a0
 ; RV64I-NEXT:    srli a0, sp, 12
 ; RV64I-NEXT:    slli sp, a0, 12
 ; RV64I-NEXT:    lui a0, 1
-; RV64I-NEXT:    add a0, sp, a0
+; RV64I-NEXT:    add a0, a0, sp
 ; RV64I-NEXT:    call callee@plt
 ; RV64I-NEXT:    lui a0, 2
 ; RV64I-NEXT:    sub sp, s0, a0
-; RV64I-NEXT:    addiw a0, a0, -2032
+; RV64I-NEXT:    li a0, 15
+; RV64I-NEXT:    slli a0, a0, 9
 ; RV64I-NEXT:    add sp, sp, a0
-; RV64I-NEXT:    ld ra, 2024(sp) # 8-byte Folded Reload
-; RV64I-NEXT:    ld s0, 2016(sp) # 8-byte Folded Reload
-; RV64I-NEXT:    addi sp, sp, 2032
+; RV64I-NEXT:    ld ra, 504(sp) # 8-byte Folded Reload
+; RV64I-NEXT:    ld s0, 496(sp) # 8-byte Folded Reload
+; RV64I-NEXT:    addi sp, sp, 512
 ; RV64I-NEXT:    ret
   %1 = alloca i8, align 4096
   call void @callee(ptr %1)

I think that is because we need to do a larger second stack adjustment if we do a small first stack adjustment.
So the impact on performance should be evalated so that we can decide whether this optimization should be enabled under -Os/-Oz only. :-)

lcvon007 added a comment.EditedAug 12 2023, 7:59 AM

It seems that compiler doesn't generate the best code, and it can generate addi a0, a0, -256 like addi a0, a0, -2032, but it generates li a0, 31, slli a0, a0, 8 => 31 << 8 = 2 * 4096 -256, so I think threre is another optimization point here @wangpc

I find that the reason why my optimization will add one extra instruction is that the difference in building big immediate 8192-2032 and 8192 - 512,
lui a0, 2, addiw a0, a0, -2032 is for 8192-2032
li a0, 15, slli a0, a0, 9 is for 8192-256,
and lui a0, 2 will be optimized later, so if build 8192 - 512 use lui a0, 2, addiw a0, a0, -256, the result will be similiar.

I show the codesize data here firstly and will provide the performace data later:

codesize:

craig.topper added inline comments.Aug 12 2023, 10:40 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1315

Can we compute CompressLen for both RV32 and RV64 as XLen * 8? And then merge the 2 if statements?

1316

c.ldsp?

1321

return 256

1324

return 512

1326

return 2048 - StackAlign

lcvon007 updated this revision to Diff 549689.Aug 13 2023, 2:23 AM

submit codes for craig.topper's review opinion

craig.topper added inline comments.Aug 13 2023, 11:40 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1313

What about Zca?

1322

Use * 8. Let the compiler convert it to a shift.

lcvon007 marked 5 inline comments as done.Aug 13 2023, 7:41 PM

I just found a regression:

--- a/llvm/test/CodeGen/RISCV/stack-realignment.ll
+++ b/llvm/test/CodeGen/RISCV/stack-realignment.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=riscv32 -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv32 -mattr=+c -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s -check-prefix=RV32I
-; RUN: llc -mtriple=riscv64 -verify-machineinstrs < %s \
+; RUN: llc -mtriple=riscv64 -mattr=+c -verify-machineinstrs < %s \
 ; RUN:   | FileCheck %s -check-prefix=RV64I
 
 declare void @callee(ptr)
@@ -529,56 +529,58 @@ define void @caller_no_realign2048() "no-realign-stack" {
 define void @caller4096() {
 ; RV32I-LABEL: caller4096:
 ; RV32I:       # %bb.0:
-; RV32I-NEXT:    addi sp, sp, -2032
-; RV32I-NEXT:    .cfi_def_cfa_offset 2032
-; RV32I-NEXT:    sw ra, 2028(sp) # 4-byte Folded Spill
-; RV32I-NEXT:    sw s0, 2024(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    addi sp, sp, -256
+; RV32I-NEXT:    .cfi_def_cfa_offset 256
+; RV32I-NEXT:    sw ra, 252(sp) # 4-byte Folded Spill
+; RV32I-NEXT:    sw s0, 248(sp) # 4-byte Folded Spill
 ; RV32I-NEXT:    .cfi_offset ra, -4
 ; RV32I-NEXT:    .cfi_offset s0, -8
-; RV32I-NEXT:    addi s0, sp, 2032
+; RV32I-NEXT:    addi s0, sp, 256
 ; RV32I-NEXT:    .cfi_def_cfa s0, 0
-; RV32I-NEXT:    lui a0, 2
-; RV32I-NEXT:    addi a0, a0, -2032
+; RV32I-NEXT:    li a0, 31
+; RV32I-NEXT:    slli a0, a0, 8
 ; RV32I-NEXT:    sub sp, sp, a0
 ; RV32I-NEXT:    srli a0, sp, 12
 ; RV32I-NEXT:    slli sp, a0, 12
 ; RV32I-NEXT:    lui a0, 1
-; RV32I-NEXT:    add a0, sp, a0
+; RV32I-NEXT:    add a0, a0, sp
 ; RV32I-NEXT:    call callee@plt
 ; RV32I-NEXT:    lui a0, 2
 ; RV32I-NEXT:    sub sp, s0, a0
-; RV32I-NEXT:    addi a0, a0, -2032
+; RV32I-NEXT:    li a0, 31
+; RV32I-NEXT:    slli a0, a0, 8
 ; RV32I-NEXT:    add sp, sp, a0
-; RV32I-NEXT:    lw ra, 2028(sp) # 4-byte Folded Reload
-; RV32I-NEXT:    lw s0, 2024(sp) # 4-byte Folded Reload
-; RV32I-NEXT:    addi sp, sp, 2032
+; RV32I-NEXT:    lw ra, 252(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    lw s0, 248(sp) # 4-byte Folded Reload
+; RV32I-NEXT:    addi sp, sp, 256
 ; RV32I-NEXT:    ret
 ;
 ; RV64I-LABEL: caller4096:
 ; RV64I:       # %bb.0:
-; RV64I-NEXT:    addi sp, sp, -2032
-; RV64I-NEXT:    .cfi_def_cfa_offset 2032
-; RV64I-NEXT:    sd ra, 2024(sp) # 8-byte Folded Spill
-; RV64I-NEXT:    sd s0, 2016(sp) # 8-byte Folded Spill
+; RV64I-NEXT:    addi sp, sp, -512
+; RV64I-NEXT:    .cfi_def_cfa_offset 512
+; RV64I-NEXT:    sd ra, 504(sp) # 8-byte Folded Spill
+; RV64I-NEXT:    sd s0, 496(sp) # 8-byte Folded Spill
 ; RV64I-NEXT:    .cfi_offset ra, -8
 ; RV64I-NEXT:    .cfi_offset s0, -16
-; RV64I-NEXT:    addi s0, sp, 2032
+; RV64I-NEXT:    addi s0, sp, 512
 ; RV64I-NEXT:    .cfi_def_cfa s0, 0
-; RV64I-NEXT:    lui a0, 2
-; RV64I-NEXT:    addiw a0, a0, -2032
+; RV64I-NEXT:    li a0, 15
+; RV64I-NEXT:    slli a0, a0, 9
 ; RV64I-NEXT:    sub sp, sp, a0
 ; RV64I-NEXT:    srli a0, sp, 12
 ; RV64I-NEXT:    slli sp, a0, 12
 ; RV64I-NEXT:    lui a0, 1
-; RV64I-NEXT:    add a0, sp, a0
+; RV64I-NEXT:    add a0, a0, sp
 ; RV64I-NEXT:    call callee@plt
 ; RV64I-NEXT:    lui a0, 2
 ; RV64I-NEXT:    sub sp, s0, a0
-; RV64I-NEXT:    addiw a0, a0, -2032
+; RV64I-NEXT:    li a0, 15
+; RV64I-NEXT:    slli a0, a0, 9
 ; RV64I-NEXT:    add sp, sp, a0
-; RV64I-NEXT:    ld ra, 2024(sp) # 8-byte Folded Reload
-; RV64I-NEXT:    ld s0, 2016(sp) # 8-byte Folded Reload
-; RV64I-NEXT:    addi sp, sp, 2032
+; RV64I-NEXT:    ld ra, 504(sp) # 8-byte Folded Reload
+; RV64I-NEXT:    ld s0, 496(sp) # 8-byte Folded Reload
+; RV64I-NEXT:    addi sp, sp, 512
 ; RV64I-NEXT:    ret
   %1 = alloca i8, align 4096
   call void @callee(ptr %1)

I think that is because we need to do a larger second stack adjustment if we do a small first stack adjustment.
So the impact on performance should be evalated so that we can decide whether this optimization should be enabled under -Os/-Oz only. :-)

It seems that compiler doesn't generate the best code, and it can generate addi a0, a0, -256 like addi a0, a0, -2032, but it generates li a0, 31, slli a0, a0, 8 => 31 << 8 = 2 * 4096 -256, @wangpc

I find that the reason why my optimization will add one extra instruction is that the difference in building large immediate 8192-2032 and 8192 - 512,
lui a0, 2, addiw a0, a0, -2032 is for 8192-2032
li a0, 15, slli a0, a0, 9 is for 8192-256,
and lui a0, 2 will be optimized later, so if build 8192 - 512 use lui a0, 2, addiw a0, a0, -256, the result will be similiar, so this regression may be avoided.

I show the codesize data here firstly and will provide the performace data later:

codesize:

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1315

done, thanks for your nice advice

1316

done, add other related instructions case

1326

done

wangpc added inline comments.Aug 14 2023, 3:04 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

Do we really need this condition (StackSize <= RVCompressLen + 2048 || StackSize > 2048 * 3 - StackAlign)?

lcvon007 updated this revision to Diff 549901.Aug 14 2023, 6:12 AM
lcvon007 marked 3 inline comments as done.

Add Zca condition and change << into normal multiplication.

This comment was removed by lcvon007.
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

Do we really need this condition (StackSize <= RVCompressLen + 2048 || StackSize > 2048 * 3 - StackAlign)?

As you see in the first version of implementation, it will add extra instructions(because the second SP amount may be too larger to use more instructions to build the immediate) if we don't add this condition, and the performance may regress in some cases, and it's better to remove this condition if we only want to optimize the codesize.

1324

Do we really need this condition (StackSize <= RVCompressLen + 2048 || StackSize > 2048 * 3 - StackAlign)?

lcvon007 added inline comments.Aug 14 2023, 6:29 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1322

done

1324

Do we really need this condition (StackSize <= RVCompressLen + 2048 || StackSize > 2048 * 3 - StackAlign)?

As you see in the first version of implementation, it will add extra instructions(because the second SP amount may be too larger to use more instructions to build the immediate) if we don't add this condition, and the performance may regress in some cases, and it's better to remove this condition if we only want to optimize the codesize.

lcvon007 marked an inline comment as done.Aug 14 2023, 6:34 AM
lcvon007 added inline comments.
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1313

I agree with you about Zca, and have added it yet.

lcvon007 marked an inline comment as done.Aug 14 2023, 6:42 AM

For the program's performance, it has not got obvious profit or regression(I only test them in O2 and run with one copy about spec cpu2006), I also disassemblers each binary and see that they 're almost same except some cases like the follows(no extra instructions generated.): @wangpc

  1. auipc + jalr => jal(optimized version)
  2. li a7, 7, slli a7, a7, 0xb => lui a7, 0x4, addiw a7, a7, -528(optimized version)
  3. addi a0, a0, -104 => mv a0, a0
wangpc added inline comments.Aug 14 2023, 9:17 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

If so, then I think we can loose the condition when optimizing for size? What do you think about it?

lcvon007 added inline comments.Aug 14 2023, 4:40 PM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

yes, is it ok to add this feature in other commit? I actually have other ideas to decrease the code size too so I can try it together.

wangpc accepted this revision.Aug 14 2023, 8:35 PM

LGTM but please let @craig.topper do the final approval.

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

Is 2048 * 3 an empirical value? I'm sorry that I can't figure out the reason.
If we want second adjustment to be larger than 2048 , should it be 2048*2-StackAlign?

llvm/test/CodeGen/RISCV/stack-inst-compress.mir
14

Nit: the LLVM IR in a MIR test can be just function stub, the function body can be removed.

This revision is now accepted and ready to land.Aug 14 2023, 8:35 PM
lcvon007 added inline comments.Aug 14 2023, 9:03 PM
llvm/test/CodeGen/RISCV/stack-inst-compress.mir
14

Do you mean I only provide only a decalare here? , like:
declare dso_local void @_Z18caller_small_stackv(), and the compiler will report error "basic block 'entry' is not defined in the function '_Z18caller_small_stackv'", so
does it need other change too or keep the body here as now?

wangpc added inline comments.Aug 14 2023, 9:31 PM
llvm/test/CodeGen/RISCV/stack-inst-compress.mir
14
--- a/llvm/test/CodeGen/RISCV/stack-inst-compress.mir
+++ b/llvm/test/CodeGen/RISCV/stack-inst-compress.mir
@@ -10,23 +10,13 @@
 --- |
   define dso_local void @_Z18caller_small_stackv() {
   entry:
-    %arr = alloca [517 x i32], align 4
-    call void @llvm.memset.p0.i64(ptr align 4 %arr, i8 0, i64 2068, i1 false)
-    %arraydecay = getelementptr inbounds [517 x i32], ptr %arr, i64 0, i64 0
-    call void @_Z6calleePi(ptr noundef %arraydecay)
     ret void
   }
 
-  declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)
-
   declare dso_local void @_Z6calleePi(ptr noundef)
 
   define dso_local void @_Z19caller_larger_stackv() {
   entry:
-    %arr = alloca [1536 x i32], align 4
-    call void @llvm.memset.p0.i64(ptr align 4 %arr, i8 0, i64 6144, i1 false)
-    %arraydecay = getelementptr inbounds [1536 x i32], ptr %arr, i64 0, i64 0
-    call void @_Z6calleePi(ptr noundef %arraydecay)
     ret void
   }
 
@@ -40,7 +30,7 @@ frameInfo:
   hasCalls:        true
   localFrameSize:  2068
 stack:
-  - { id: 0, name: arr, size: 2068, alignment: 4, local-offset: -2068 }
+  - { id: 0, size: 2068, alignment: 4, local-offset: -2068 }
   - { id: 1, type: spill-slot, size: 8, alignment: 8 }
 machineFunctionInfo:
   varArgsFrameIndex: 0
@@ -93,7 +83,7 @@ body:             |
     ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
     renamable $x10 = LUI 1
     renamable $x12 = ADDIW killed renamable $x10, -2028
-    renamable $x10 = ADDI %stack.0.arr, 0
+    renamable $x10 = ADDI %stack.0, 0
     SD $x10, %stack.1, 0 :: (store (s64) into %stack.1)
     renamable $x11 = COPY $x0
     PseudoCALL target-flags(riscv-plt) &memset, csr_ilp32_lp64, implicit-def dead $x1, implicit killed $x10, implicit killed $x11, implicit killed $x12, implicit-def $x2, implicit-def $x10
@@ -115,7 +105,7 @@ frameInfo:
   hasCalls:        true
   localFrameSize:  6144
 stack:
-  - { id: 0, name: arr, size: 6144, alignment: 4, local-offset: -6144 }
+  - { id: 0, size: 6144, alignment: 4, local-offset: -6144 }
   - { id: 1, type: spill-slot, size: 8, alignment: 8 }
 machineFunctionInfo:
   varArgsFrameIndex: 0
@@ -184,7 +174,7 @@ body:             |
     ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
     renamable $x10 = ADDI $x0, 3
     renamable $x12 = SLLI killed renamable $x10, 11
-    renamable $x10 = ADDI %stack.0.arr, 0
+    renamable $x10 = ADDI %stack.0, 0
     SD $x10, %stack.1, 0 :: (store (s64) into %stack.1)
     renamable $x11 = COPY $x0
     PseudoCALL target-flags(riscv-plt) &memset, csr_ilp32_lp64, implicit-def dead $x1, implicit killed $x10, implicit killed $x11, implicit killed $x12, implicit-def $x2, implicit-def $x10

And if you are using utils/update_mir_test_checks.py to generate CHECKs, don't remove the output lines. Or you should remove the line 1:

# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 2
lcvon007 updated this revision to Diff 550192.Aug 14 2023, 10:15 PM

NFC, update testcase to simplify it.

lcvon007 added inline comments.Aug 14 2023, 10:19 PM
llvm/test/CodeGen/RISCV/stack-inst-compress.mir
14

Done, I use update_mir_test_checks.py and remove some checkers, and I have removed the line 1 as you suggest, thanks very much.

lcvon007 updated this revision to Diff 550252.Aug 15 2023, 4:17 AM

rebase main branch

lcvon007 updated this revision to Diff 550258.Aug 15 2023, 4:38 AM

rebase main

lcvon007 marked an inline comment as done.Aug 15 2023, 7:34 AM
lcvon007 marked 3 inline comments as done.
lcvon007 added inline comments.Aug 15 2023, 8:30 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

[2048, 2048+512]: before opt: addi + addi , after opt: addi + addi
(2048 + 512, 2048 * 2 - StackAlign] : before opt: addi + addi, after opt: addi + addi + addi : not good
(2048 * 2 -stackAlign, 2048 * 2 + 512] before opt: addi + addi + addi, after opt : addi + addi + addi
(2048 * 2 + 512, 2048 * 3 -StackAlign] : before opt: addi + addi + addi, after opt: addi + lui + addi + addi sp, sp, x?
(2048 * 3 -StackAlign, +inf): before: addi + lui+ addi + addi +addi sp, sp, x? , after opt: addi + lui+ addi + addi sp, sp, x?
it's not good to use '2048*2 -StackAlign' but I may add extra condition like
if (StackSize <= RVCompressLen + 2048 || (StackSize > 2048 * 2 -stackAlign && StackSize <=2048 * 2 + 512) || StackSize > 2048 * 3 -StackAlign )

lcvon007 updated this revision to Diff 550356.Aug 15 2023, 9:07 AM

add extra compression condition.

wangpc added inline comments.Aug 15 2023, 7:58 PM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

OK, I get it. Can we add more tests for different stack sizes as you said? And some suggestions on test: 1) MIRs in test can be further reduced (for example , the call to memset is really needed?), 2) rename functions to stack_size_n where n is the stack size.

lcvon007 marked an inline comment as done.Aug 15 2023, 8:55 PM
lcvon007 updated this revision to Diff 550612.Aug 15 2023, 10:21 PM

refine the function name in testcases and add one extra tests for
stacksize that is in (2048 * 2 -StackAlign, 2048 * 2 + 512].

lcvon007 added inline comments.Aug 15 2023, 10:29 PM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324
  1. I have added the new function in testcases for new added condition; 2. remove the memset and rename function;

I have two questions and ask for your help,

  1. I find the CI has failed and it's ok in local, and do you know the reason why it fails?
  2. I don't have commit access now, and may you help me submit this patch when it's ok?

thanks very much for you.

hi @craig.topper , the review opinions you say have been solved yet, may you help check if this commit is ok now?

craig.topper added inline comments.Aug 16 2023, 11:56 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1304

This comment is about the code after the compressed handling. Should it be moved down?

1324

Isn't CSI.size() the number of registers being saved? RVCompressLen is 256 or 512. Won't CSI.size() always be less than that? They aren't the same units.

lcvon007 updated this revision to Diff 550962.Aug 16 2023, 7:27 PM

add more comments in the code and remove the CSI.size check as craig.topper says.

lcvon007 added inline comments.Aug 16 2023, 7:30 PM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1304

I add 'at most' in the comment but I don't move down them, because 2048-StackAlign is the background for the following codes, so I think it's better to leave them here as before, and I add more comments, is it ok?

lcvon007 marked an inline comment as done.Aug 16 2023, 7:32 PM
lcvon007 added inline comments.Aug 16 2023, 7:35 PM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

you're right and they're the same units, and CSI.size is always less than RVCompressLen, I have removed this check , thanks for your advice.

craig.topper added inline comments.Aug 16 2023, 11:10 PM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

I think 2048 * 2 - StackAlign was not addi, addi with the old code. The FirstSPAdjust would be 2048-StackAlign or 2040. That leaves 2048 left, that can't be represented by an ADDI. So I think 2048 * 2 - StackAlign can use the compressed code?

I didn't recheck the other boundaries yet.

I think 2048 * 2 + 512 is not 3 addis with the new code. After adding 512, we can't do the remaining 4096 in two addis. We can only do a maximum of 4094 with 2 addis.

craig.topper requested changes to this revision.Aug 16 2023, 11:13 PM
This revision now requires changes to proceed.Aug 16 2023, 11:13 PM
lcvon007 updated this revision to Diff 551045.Aug 17 2023, 1:50 AM

update 2048 into 2047 considering the positive offset of addi.

lcvon007 added a comment.EditedAug 17 2023, 1:51 AM

I think 2048 * 2 + 512 is not 3 addis with the new code. After adding 512, we can't do the remaining 4096 in two addis. We can only do a maximum of 4094 with 2 addis.

yes, I did not consider the positive offset is only 2047, I think we can list the condition like the following(Use 512 of RVCompressLen as example):
the instuctions include prologue and epilogue together.

  1. [2048, 2047+512]: before opt: (addi + addi) x 2 , after opt: (addi + addi) x2: good
  2. 2048 + 512: before opt: (addi + addi) x 2, after opt: (addi + addi) + addi + addi + addi : not good
  3. (2048 + 512, 2048 - StackAlign + 2047] : before opt: (addi + addi) x 2, after opt: (addi + addi + addi) x 2 : not good
  4. 2048 -StackAlign + 2048 : before opt: (addi + addi) + (addi + addi + addi), after opt: (addi + addi + addi) x2: not good
  5. (2048 -stackAlign + 2048, 2047 * 2 + 512] before opt: (addi + addi + addi) x 2, after opt : (addi + addi + addi) x 2: good
  6. (2047 * 2 + 512, 2048 * 2 + 512]: before opt: (addi + addi + addi) x 2, after opt: (addi + addi + addi) + (addi + LUI + addi + add) not good
  7. (2048 * 2 + 512, 2048 -StackAlign + 2047 * 2] : before opt: (addi + addi + addi) x 2, after opt: (addi + lui + addi + add) x 2: not good
  8. 2048 -StackAlign + 2048 + 2047: before opt: (addi + addi + addi) + addi + lui + addi + add, after opt: (addi + lui + addi + add) x 2: not good
  9. 2048 - StackAlign + 2048 + 2048: before opt: (addi + addi + addi) + addi + lui + addi + add, after opt: (addi + lui + addi + add) x 2: not good

10: (2048 * 3 -StackAlign, +inf): before: (addi + lui+ addi + addi +add) * 2, after opt: (addi + lui+ addi + add) x 2 : good
so the new conditions need to be:
if (StackSize <= 2047+ RVCompressLen || (StackSize > 2048 * 2- StackAlign && StackSize <= 2047 * 2 + RVCompressLen) || StackSize > 2048 * 3 -StackAlign), and I have adjusted it, please help review(thanks very much).

This revision is now accepted and ready to land.Aug 17 2023, 8:03 AM
lcvon007 marked 2 inline comments as done.Aug 17 2023, 6:18 PM
lcvon007 marked an inline comment as done.Aug 17 2023, 6:21 PM
lcvon007 added inline comments.
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
1324

exclude 2048 * 2 -StackAlign yet

lcvon007 marked 2 inline comments as done.Aug 17 2023, 6:21 PM
This revision was automatically updated to reflect the committed changes.