Page MenuHomePhabricator

[RISCV] Set CostPerUse for floating point registers
Needs ReviewPublic

Authored by pcwang-thead on Jan 24 2022, 2:00 AM.

Details

Summary

Set CostPerUse to 1 for floating point registers when RVC is
enabled so that more compressed instructions will be generated.

Code size and performance have some improvements.

SPEC FP 2006 (On Allwinner's D1 chip, with XuanTie C906):

                 Code size     Performance
453.povray        -1.145%        +7.926%
433.milc             -           +1.399%
450.soplex        -0.905%        +1.177%
470.lbm              -           +0.188%
444.namd          -1.882%        +0.124%
447.dealII        -0.440%        +0.053%
482.sphinx3          -           -1.569%

For CSiBE, we reduced 1%-5% code size under -Oz.

Diff Detail

Event Timeline

pcwang-thead created this revision.Jan 24 2022, 2:00 AM
pcwang-thead requested review of this revision.Jan 24 2022, 2:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2022, 2:00 AM
asb added a comment.Jan 24 2022, 4:45 AM

This is causing multiple compiler-time failures (assertions) for me on the GCC torture suite. e.g. 930608-1.c for rv32imafdc with the ilp32 ABI at O{1,2,3,s}. pr44942.c fails similarly for rv32imafdc with the ilp32d ABI at O{1,2,3,s}. It seems likely it's unmasking a bug elsewhere, though I haven't done any more delving.

clang: /home/asb/llvm-project/llvm/include/llvm/CodeGen/LiveInterval.h:378: llvm::SlotIndex llvm::LiveRange::beginIndex() const: Assertion `!empty() && "Call to beginIndex() on empty range."' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /home/asb/llvm-project/build/default/bin/clang -cc1 -triple riscv32-unknown-linux-gnu -S -save-temps=obj -disable-free -clear-ast-before-backend -main-file-name 930608-1.c -mrelocation-model static -mframe-pointer=none -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -target-feature +m -target-feature +a -target-feature +f -target-feature +d -target-feature +c -target-feature +relax -target-feature -save-restore -target-abi ilp32 -msmall-data-limit 8 -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -fcoverage-compilation-dir=/home/asb/torture -resource-dir /home/asb/llvm-project/build/release/lib/clang/14.0.0 -O1 -fdebug-compilation-dir=/home/asb/torture -ferror-limit 19 -fno-signed-char -fgnuc-version=4.2.1 -Qn -faddrsig -o ./output_rv32imafdc_ilp32_O1/930608-1.s 930608-1.c
1.	<eof> parser at end of file
2.	Code generation
3.	Running pass 'Function Pass Manager' on module '930608-1.c'.
4.	Running pass 'Greedy Register Allocator' on function '@f'
 #0 0x00007f8398acc207 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/asb/llvm-project/llvm/lib/Support/Unix/Signals.inc:565:11
 #1 0x00007f8398acc41b PrintStackTraceSignalHandler(void*) /home/asb/llvm-project/llvm/lib/Support/Unix/Signals.inc:632:1
 #2 0x00007f8398aca7d7 llvm::sys::RunSignalHandlers() /home/asb/llvm-project/llvm/lib/Support/Signals.cpp:97:5
 #3 0x00007f8398accc31 SignalHandler(int) /home/asb/llvm-project/llvm/lib/Support/Unix/Signals.inc:0:3
 #4 0x00007f83a1c01870 __restore_rt sigaction.c:0:0
 #5 0x00007f839815fd22 raise (/usr/lib/libc.so.6+0x3cd22)
 #6 0x00007f8398149862 abort (/usr/lib/libc.so.6+0x26862)
 #7 0x00007f8398149747 _nl_load_domain.cold loadmsgcat.c:0:0
 #8 0x00007f8398158616 (/usr/lib/libc.so.6+0x35616)
 #9 0x00007f839f29bbf4 llvm::LiveRange::beginIndex() const /home/asb/llvm-project/llvm/include/llvm/CodeGen/LiveInterval.h:0:7
#10 0x00007f839f2aa9ce llvm::LiveIntervals::intervalIsInOneMBB(llvm::LiveInterval const&) const /home/asb/llvm-project/llvm/lib/CodeGen/LiveIntervals.cpp:837:24
#11 0x00007f839f702144 llvm::DefaultEvictionAdvisor::canEvictInterferenceBasedOnCost(llvm::LiveInterval&, llvm::MCRegister, bool, llvm::EvictionCost&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register> > const&) const /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:500:23
#12 0x00007f839f703368 llvm::DefaultEvictionAdvisor::tryFindEvictionCandidate(llvm::LiveInterval&, llvm::AllocationOrder const&, unsigned char, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register> > const&) const /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:783:9
#13 0x00007f839f7019a8 llvm::RAGreedy::tryEvict(llvm::LiveInterval&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&, unsigned char, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register> > const&) /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:809:39
#14 0x00007f839f7012c7 llvm::RAGreedy::tryAssign(llvm::LiveInterval&, llvm::AllocationOrder&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register> > const&) /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:408:25
#15 0x00007f839f70bd31 llvm::RAGreedy::selectOrSplitImpl(llvm::LiveInterval&, llvm::SmallVectorImpl<llvm::Register>&, llvm::SmallSet<llvm::Register, 16u, std::less<llvm::Register> >&, unsigned int) /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2629:11
#16 0x00007f839f70c79d llvm::RAGreedy::selectOrSplit(llvm::LiveInterval&, llvm::SmallVectorImpl<llvm::Register>&) /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2361:20
#17 0x00007f839f70c94c non-virtual thunk to llvm::RAGreedy::selectOrSplit(llvm::LiveInterval&, llvm::SmallVectorImpl<llvm::Register>&) /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:0:0
#18 0x00007f839f6e1ffd llvm::RegAllocBase::allocatePhysRegs() /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocBase.cpp:112:35
#19 0x00007f839f70f7cb llvm::RAGreedy::runOnMachineFunction(llvm::MachineFunction&) /home/asb/llvm-project/llvm/lib/CodeGen/RegAllocGreedy.cpp:2942:3
#20 0x00007f839f42c4cc llvm::MachineFunctionPass::runOnFunction(llvm::Function&) /home/asb/llvm-project/llvm/lib/CodeGen/MachineFunctionPass.cpp:72:8
#21 0x00007f8399c7a6fd llvm::FPPassManager::runOnFunction(llvm::Function&) /home/asb/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1435:23
#22 0x00007f8399c7fcdf llvm::FPPassManager::runOnModule(llvm::Module&) /home/asb/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1481:16
#23 0x00007f8399c7b0aa (anonymous namespace)::MPPassManager::runOnModule(llvm::Module&) /home/asb/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1550:23
#24 0x00007f8399c7abbd llvm::legacy::PassManagerImpl::run(llvm::Module&) /home/asb/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:540:16
#25 0x00007f8399c7ffe1 llvm::legacy::PassManager::run(llvm::Module&) /home/asb/llvm-project/llvm/lib/IR/LegacyPassManager.cpp:1677:3
#26 0x00007f83a0076ef5 (anonymous namespace)::EmitAssemblyHelper::RunCodegenPipeline(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >&, std::unique_ptr<llvm::ToolOutputFile, std::default_delete<llvm::ToolOutputFile> >&) /home/asb/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1535:3
#27 0x00007f83a006ecb4 (anonymous namespace)::EmitAssemblyHelper::EmitAssembly(clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /home/asb/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1566:7
#28 0x00007f83a006d3f5 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, std::unique_ptr<llvm::raw_pwrite_stream, std::default_delete<llvm::raw_pwrite_stream> >) /home/asb/llvm-project/clang/lib/CodeGen/BackendUtil.cpp:1727:5
#29 0x00007f83a083b1ef clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) /home/asb/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:370:7
#30 0x00007f8392b908ef clang::ParseAST(clang::Sema&, bool, bool) /home/asb/llvm-project/clang/lib/Parse/ParseAST.cpp:178:12
#31 0x00007f839d4b1a2c clang::ASTFrontendAction::ExecuteAction() /home/asb/llvm-project/clang/lib/Frontend/FrontendAction.cpp:1076:1
#32 0x00007f83a08342eb clang::CodeGenAction::ExecuteAction() /home/asb/llvm-project/clang/lib/CodeGen/CodeGenAction.cpp:1107:5
#33 0x00007f839d4b1359 clang::FrontendAction::Execute() /home/asb/llvm-project/clang/lib/Frontend/FrontendAction.cpp:971:7
#34 0x00007f839d3b669f clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) /home/asb/llvm-project/clang/lib/Frontend/CompilerInstance.cpp:1030:23
#35 0x00007f83a1bd1c44 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) /home/asb/llvm-project/clang/lib/FrontendTool/ExecuteCompilerInvocation.cpp:261:8
#36 0x00005607d246ddcc cc1_main(llvm::ArrayRef<char const*>, char const*, void*) /home/asb/llvm-project/clang/tools/driver/cc1_main.cpp:246:13
#37 0x00005607d246018a ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) /home/asb/llvm-project/clang/tools/driver/driver.cpp:317:5
#38 0x00005607d245f1d9 main /home/asb/llvm-project/clang/tools/driver/driver.cpp:388:5
#39 0x00007f839814ab25 __libc_start_main (/usr/lib/libc.so.6+0x27b25)
#40 0x00005607d245e9be _start (/home/asb/llvm-project/build/default/bin/clang+0x3d9be)
Aborted (core dumped)

This is causing multiple compiler-time failures (assertions) for me on the GCC torture suite. e.g. 930608-1.c for rv32imafdc with the ilp32 ABI at O{1,2,3,s}. pr44942.c fails similarly for rv32imafdc with the ilp32d ABI at O{1,2,3,s}. It seems likely it's unmasking a bug elsewhere, though I haven't done any more delving.

That's weird, I will have a look.

Thanks.

This is causing multiple compiler-time failures (assertions) for me on the GCC torture suite. e.g. 930608-1.c for rv32imafdc with the ilp32 ABI at O{1,2,3,s}. pr44942.c fails similarly for rv32imafdc with the ilp32d ABI at O{1,2,3,s}. It seems likely it's unmasking a bug elsewhere, though I haven't done any more delving.

That's weird, I will have a look.

Thanks.

I may have fixed this bug (these two GCC torture tests are OK at least ) in D118124.

But I have the same question as MatzeB said, why will it generate such code?

For example, f in 930608-1.c only contains one return instruction with GCC for RISCV or clang for ARM/AArch64 target:

double f (double a) {}
f:
	ret

And below is what we have now:

f: 
	addi	sp, sp, -16
	fsd	fs1, 8(sp)
	lw	a0, 8(sp)
	lw	a1, 12(sp)
	addi	sp, sp, 16
	ret

In the MIR level, we can eliminate IMPLICIT_DEFs after processimpdefs pass:

bb.0.entry:
  %4:fpr64 = IMPLICIT_DEF                                             # Eliminated
  FSD killed %4:fpr64, %stack.0, 0 :: (store (s64) into %stack.0)     # -> FSD killed undef %4:fpr64, %stack.0, 0 :: (store (s64) into %stack.0)
  %2:gpr = LW %stack.0, 0 :: (load (s32) from %stack.0, align 8)
  %3:gpr = LW %stack.0, 4 :: (load (s32) from %stack.0 + 4, basealign 8)
  $x10 = COPY %2:gpr
  $x11 = COPY %3:gpr
  PseudoRET implicit $x10, implicit $x11

But we can't eliminate these dead stores.

I guess there may have some problems in handling BuildPairF64 and SplitF64.

This is causing multiple compiler-time failures (assertions) for me on the GCC torture suite. e.g. 930608-1.c for rv32imafdc with the ilp32 ABI at O{1,2,3,s}. pr44942.c fails similarly for rv32imafdc with the ilp32d ABI at O{1,2,3,s}. It seems likely it's unmasking a bug elsewhere, though I haven't done any more delving.

That's weird, I will have a look.

Thanks.

I may have fixed this bug (these two GCC torture tests are OK at least ) in D118124.

But I have the same question as MatzeB said, why will it generate such code?

For example, f in 930608-1.c only contains one return instruction with GCC for RISCV or clang for ARM/AArch64 target:

double f (double a) {}
f:
	ret

And below is what we have now:

f: 
	addi	sp, sp, -16
	fsd	fs1, 8(sp)
	lw	a0, 8(sp)
	lw	a1, 12(sp)
	addi	sp, sp, 16
	ret

In the MIR level, we can eliminate IMPLICIT_DEFs after processimpdefs pass:

bb.0.entry:
  %4:fpr64 = IMPLICIT_DEF                                             # Eliminated
  FSD killed %4:fpr64, %stack.0, 0 :: (store (s64) into %stack.0)     # -> FSD killed undef %4:fpr64, %stack.0, 0 :: (store (s64) into %stack.0)
  %2:gpr = LW %stack.0, 0 :: (load (s32) from %stack.0, align 8)
  %3:gpr = LW %stack.0, 4 :: (load (s32) from %stack.0 + 4, basealign 8)
  $x10 = COPY %2:gpr
  $x11 = COPY %3:gpr
  PseudoRET implicit $x10, implicit $x11

But we can't eliminate these dead stores.

I guess there may have some problems in handling BuildPairF64 and SplitF64.

We probably need a DAG combine to turn SplitF64 undef into a pair of integer undefs.

asb added a comment.Feb 17 2022, 7:55 AM

I'm surprised this resulted in performance increases. I might have guessed that with so few FP instructions being compressible, the further constraint on register selection might be more likely to result in a (slight) decrease in performance. Shows the value of running the benchmarks!

I've put this patch on the agenda for the RISC-V LLVM call today, but based on the data so far this seems to make sense.

I'm surprised this resulted in performance increases. I might have guessed that with so few FP instructions being compressible, the further constraint on register selection might be more likely to result in a (slight) decrease in performance. Shows the value of running the benchmarks!

I've put this patch on the agenda for the RISC-V LLVM call today, but based on the data so far this seems to make sense.

I am surprised too.

IMO, there is a possible reason that may explain the performance increases:

  • When register number is in [8, 15], instructions can be compressed.
  • For the first 16 integer registers, registers x0-x4(and sometimes x5) are reserved for special usage, and the register allocation orders are like below:
def GPR : RegisterClass<"RISCV", [XLenVT], 32, (add
    (sequence "X%u", 10, 17),
    (sequence "X%u", 5, 7),
    (sequence "X%u", 28, 31),
    (sequence "X%u", 8, 9),
    (sequence "X%u", 18, 27),
    (sequence "X%u", 0, 4)
  )> {
  let RegInfos = XLenRI;
}

which means we will allocates most RVC integer registers first.
So, for most programs, there is minimal difference whether we set CostPerUse to 0 or 1.

  • For the first 16 float registers, there is no reserved register, and the register allocation orders are like below:
def FPR32 : RegisterClass<"RISCV", [f32], 32, (add
    (sequence "F%u_F", 0, 7),
    (sequence "F%u_F", 10, 17),
    (sequence "F%u_F", 28, 31),
    (sequence "F%u_F", 8, 9),
    (sequence "F%u_F", 18, 27)
)>;

which means we will allocates temporary float registers first and most float instructions can't be compressed.
So when we set CostPerUse to 1, a lot of float instructions can be compressed, which results in improvements on icache misses.

I'm surprised this resulted in performance increases. I might have guessed that with so few FP instructions being compressible, the further constraint on register selection might be more likely to result in a (slight) decrease in performance. Shows the value of running the benchmarks!

I've put this patch on the agenda for the RISC-V LLVM call today, but based on the data so far this seems to make sense.

I am surprised too.

IMO, there is a possible reason that may explain the performance increases:

  • When register number is in [8, 15], instructions can be compressed.
  • For the first 16 integer registers, registers x0-x4(and sometimes x5) are reserved for special usage, and the register allocation orders are like below:
def GPR : RegisterClass<"RISCV", [XLenVT], 32, (add
    (sequence "X%u", 10, 17),
    (sequence "X%u", 5, 7),
    (sequence "X%u", 28, 31),
    (sequence "X%u", 8, 9),
    (sequence "X%u", 18, 27),
    (sequence "X%u", 0, 4)
  )> {
  let RegInfos = XLenRI;
}

which means we will allocates most RVC integer registers first.
So, for most programs, there is minimal difference whether we set CostPerUse to 0 or 1.

  • For the first 16 float registers, there is no reserved register, and the register allocation orders are like below:
def FPR32 : RegisterClass<"RISCV", [f32], 32, (add
    (sequence "F%u_F", 0, 7),
    (sequence "F%u_F", 10, 17),
    (sequence "F%u_F", 28, 31),
    (sequence "F%u_F", 8, 9),
    (sequence "F%u_F", 18, 27)
)>;

which means we will allocates temporary float registers first and most float instructions can't be compressed.
So when we set CostPerUse to 1, a lot of float instructions can be compressed, which results in improvements on icache misses.

Our confusion is that there are only 8 float opcodes that have compressed forms. They are all loads/stores and 4 of them are limited to RV32.

I ran 453.povray from SPEC2006 on a SiFive Unmatched board with this patch applied to our downstream compiler. My result was a 1% decrease in performance.

Should we look at what happens if we change the allocation order to use compressible argument registers first without changing the cost?

Our confusion is that there are only 8 float opcodes that have compressed forms. They are all loads/stores and 4 of them are limited to RV32.

I ran 453.povray from SPEC2006 on a SiFive Unmatched board with this patch applied to our downstream compiler. My result was a 1% decrease in performance.

I ran the whole SPEC2006 FP for several times and got the same performance increases(about 1% under geometric mean).
The reason why we got different results may be that there are some differences between micro-archs and downstreams(not for sure). I will run it again with pure upstream later.

Should we look at what happens if we change the allocation order to use compressible argument registers first without changing the cost?

Yes I think it makes sense. I will have a try.
@craig.topper

Results of running SPEC2006 FP with upstream Clang/LLVM (based on 8ad6d5e465bba198c883e699c28690b0ea79400d) and options are -march=rv64imafdc -mabi=lp64d without micro-arch specific scheduling.
All benchmarks ran 10 times and we compared the arithmetic average run times.

  1. Set CostPerUse to 1:
BenchmarkPerformance
447.dealII+2.087%
453.povray+1.354%
450.soplex+0.8854%
482.sphinx3+0.8112%
433.milc+0.743%
470.lbm+0.1381%
444.namd-0.714%

Geometric mean: +0.7544%

  1. Allocate argument floating-point register first:
BenchmarkPerformance
453.povray+1.866%
450.soplex+0.6809%
447.dealII+0.1095%
433.milc-0.01403%
444.namd-0.1019%
470.lbm-0.4855%
482.sphinx3-0.6987%

Geometric mean: +0.1906%

  1. Both 1 and 2:
BenchmarkPerformance
482.sphinx3+1.006%
433.milc+0.9639%
450.soplex+0.88%
453.povray+0.6788%
447.dealII+0.04175%
470.lbm-0.118%
444.namd-0.4662%

Geometric mean: +0.425%

All of them have code size reductions.

@craig.topper @asb

Results of running SPEC2006 FP with upstream Clang/LLVM (based on 8ad6d5e465bba198c883e699c28690b0ea79400d) and options are -march=rv64imafdc -mabi=lp64d without micro-arch specific scheduling.
All benchmarks ran 10 times and we compared the arithmetic average run times.

  1. Set CostPerUse to 1:
BenchmarkPerformance
447.dealII+2.087%
453.povray+1.354%
450.soplex+0.8854%
482.sphinx3+0.8112%
433.milc+0.743%
470.lbm+0.1381%
444.namd-0.714%

Geometric mean: +0.7544%

  1. Allocate argument floating-point register first:
BenchmarkPerformance
453.povray+1.866%
450.soplex+0.6809%
447.dealII+0.1095%
433.milc-0.01403%
444.namd-0.1019%
470.lbm-0.4855%
482.sphinx3-0.6987%

Geometric mean: +0.1906%

  1. Both 1 and 2:
BenchmarkPerformance
482.sphinx3+1.006%
433.milc+0.9639%
450.soplex+0.88%
453.povray+0.6788%
447.dealII+0.04175%
470.lbm-0.118%
444.namd-0.4662%

Geometric mean: +0.425%

All of them have code size reductions.

@craig.topper @asb

What optimization level was this? Was fast-math enabled?

Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2022, 6:42 PM

What optimization level was this? Was fast-math enabled?

O2 and -ffast-math -fno-unroll-loops.

We disabled unroll-loops because it may remain some problems under cover.

asb added a comment.Mar 17 2022, 9:23 AM

@pcwang-thread: could you please post the version that just changes the preferred allocation order for review? We had a brief discussion in the RISC-V LLVM call and think that if that change is a positive improvement for you it may be simpler to just land that, initially at least. Did you have code size measurements for that too?

@pcwang-thread: could you please post the version that just changes the preferred allocation order for review? We had a brief discussion in the RISC-V LLVM call and think that if that change is a positive improvement for you it may be simpler to just land that, initially at least. Did you have code size measurements for that too?

Thanks! I have uploaded the patch (sees D122209).

Code size has mild improvements under O2 and Oz :
O2:

BenchmarkCode size
450.soplex-0.064%
444.namd-0.093%
447.dealII-0.159%
453.povray-0.206%

Oz:

BenchmarkCode size
483.xalancbmk-0.033%
464.h264ref-0.085%
447.dealII-0.119%
453.povray-0.14%