Page MenuHomePhabricator

[PowerPC] Materialize special ConstantFP using instructions instead of load from TOC
AbandonedPublic

Authored by tingwang on Jan 26 2022, 10:24 PM.

Details

Reviewers
nemanjai
jsji
shchenz
qiucf
Group Reviewers
Restricted Project
Summary

ConstantFP in the range of [-16, 15] (excluding 0.0, which is done by XXLXOR) that can be exactly converted to integer can be materialized into register using VSPLTISW and XVCVSXWDP. This will reduce TOC usage, get rid of memory reference, and maybe save some TOC pointer calculations.

ConstantFP is expanded into load from TOC during ExpandNode. This patch identifies the opportunity before expansion, and does the manual instruction selection later. LIT test shows this patch exposes some other issues, please see my comments on the test scripts.

Since Power10 use prefixed instructions to materialize ConstantFP, maybe target Power9 for this change.

Diff Detail

Unit TestsFailed

TimeTest
60,090 msx64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
Script: -- : 'RUN: at line 4'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=address -mno-omit-leaf-frame-pointer -fno-omit-frame-pointer -fno-optimize-sibling-calls -gline-tables-only -m64 -O0 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/asan/TestCases/scariness_score_test.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/asan/X86_64LinuxConfig/TestCases/Output/scariness_score_test.cpp.tmp
60,170 msx64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg.c
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -cc1 -internal-isystem /var/lib/buildkite-agent/builds/llvm-project/build/lib/clang/15.0.0/include -nostdsysteminc -triple riscv64 -target-feature +f -target-feature +d -target-feature +zfh -target-feature +v -disable-O0-optnone -emit-llvm /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics/vluxseg.c -o - | /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -S -mem2reg | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck --check-prefix=CHECK-RV64 /var/lib/buildkite-agent/builds/llvm-project/clang/test/CodeGen/RISCV/rvv-intrinsics/vluxseg.c

Event Timeline

tingwang created this revision.Jan 26 2022, 10:24 PM
tingwang requested review of this revision.Jan 26 2022, 10:24 PM
tingwang added inline comments.Jan 26 2022, 10:26 PM
llvm/test/CodeGen/PowerPC/handle-f16-storage-type.ll
1255

"beqlr 0" converted into "beq 0, .LBB0_2; .LBB0_2: blr" sequence, introduced dummy jump. This maybe one issue.

llvm/test/CodeGen/PowerPC/scalar_cmp.ll
908

Given the operand is known and isFPImmLegal(<negative>) is checked, fsub is converted into fadd.
Combining: t15: f64 = fsub t2, ConstantFP:f64<1.000000e+00>
Creating fp constant: t17: f64 = ConstantFP<-1.000000e+00>
Creating new node: t18: f64 = fadd t2, ConstantFP:f64<-1.000000e+00>
... into: t18: f64 = fadd t2, ConstantFP:f64<-1.000000e+00>

932

This maybe one issue:
Given below DAG:

  t16: i1 = setcc t2, ConstantFP:f64<1.000000e+00>, setlt:ch                                               
  t18: i1 = setcc t2, ConstantFP:f64<1.000000e+00>, setuo:ch                                                          
t19: i1 = or t16, t18

Some logic optimized the second setcc:
Combining: t18: i1 = setcc t2, ConstantFP:f64<1.000000e+00>, setuo:ch
Creating new node: t23: i1 = setcc t2, t2, setuo:ch

So that resulted two statements which should have been executed in one fcmpu if there is no combine.

  t16: i1 = setcc t2, ConstantFP:f64<1.000000e+00>, setlt:ch
  t23: i1 = setcc t2, t2, setuo:ch
t19: i1 = or t16, t23
tingwang updated this revision to Diff 406305.Feb 6 2022, 5:31 PM

Used clang-format to tidy the code style

tingwang marked 2 inline comments as not done.Feb 6 2022, 5:33 PM

I am wondering if you have done any performance comparison or even just latency/throughput "back-of-the-envelope" computation for this. It doesn't seem obvious to me that this is better than a CP load. The conversion is a fairly expensive instruction. Another less compelling thing to consider is that in high register pressure situations (for VMX registers), the VSPLTISW may cause a spill.

One thing to note is that this necessarily replaces the cracked lfs instruction which is also expensive, so this transformation may be worthwhile. However, I wonder how this transformation would compare to stopping the conversion from double precision to single precision for constants on Power9 (i.e. return false from ShouldShrinkFPConstant() on Power9). This would of course grow the constant pool in situations where a lot of compact constants are loaded, but presumably this is a somewhat rare situation.

I am wondering if you have done any performance comparison or even just latency/throughput "back-of-the-envelope" computation for this. It doesn't seem obvious to me that this is better than a CP load. The conversion is a fairly expensive instruction. Another less compelling thing to consider is that in high register pressure situations (for VMX registers), the VSPLTISW may cause a spill.

One thing to note is that this necessarily replaces the cracked lfs instruction which is also expensive, so this transformation may be worthwhile. However, I wonder how this transformation would compare to stopping the conversion from double precision to single precision for constants on Power9 (i.e. return false from ShouldShrinkFPConstant() on Power9). This would of course grow the constant pool in situations where a lot of compact constants are loaded, but presumably this is a somewhat rare situation.

Thank you for looking into this. I'm planning to do performance comparison, and will check/compare the performance with stopping the conversion from double to single as you mentioned (LFS vs. LFD). I do agree the 'VSPLTISW' may cause a vector spill is a concern, and that cost is higher than CP load spill fixed point register, although this depends on register usage.

Simple calculation shows that from instruction latency perspective (P9InstrResources.td) VSPLTISW + XVCVSXWDP is 4 + 7 = 11, and CP load ADDI + ADDIS + LFS is 2 + 2 + 7 = 11, which is the same. Although CP load has memory access, its ADDIS maybe saved in consecutive CP loads. Looks like this maybe a draw, so I do need to check the real performance.

I am wondering if you have done any performance comparison or even just latency/throughput "back-of-the-envelope" computation for this. It doesn't seem obvious to me that this is better than a CP load. The conversion is a fairly expensive instruction. Another less compelling thing to consider is that in high register pressure situations (for VMX registers), the VSPLTISW may cause a spill.

One thing to note is that this necessarily replaces the cracked lfs instruction which is also expensive, so this transformation may be worthwhile. However, I wonder how this transformation would compare to stopping the conversion from double precision to single precision for constants on Power9 (i.e. return false from ShouldShrinkFPConstant() on Power9). This would of course grow the constant pool in situations where a lot of compact constants are loaded, but presumably this is a somewhat rare situation.

Hi Nemanja, I measured there is no performance change on spec base fprate, so I think should abandon this one. Sorry for the bother. By the way, flip ShouldShrinkFPConstant() improved single-copy base fprate by about 0.2%, though it's single run result.

tingwang abandoned this revision.Feb 18 2022, 12:58 AM

No perf gain according to some spec test.