Page MenuHomePhabricator

[RISCV] LUI used for address computation should not isAsCheapAsAMove
ClosedPublic

Authored by Luhaocong on Jan 25 2022, 11:46 PM.

Details

Summary

A LUI instruction with flag RISCVII::MO_HI is usually used in conjunction with ADDI, and jointly
complete address computation. To bind the cost evaluation of address computation, the LUI
should not be regarded as a cheap move separately, which is consistent with ADDI.

In this test case, it improves the unroll-loop code that the rematerialization of array's base address
miss MachineCSE with Heuristics #1 at isProfitableToCSE.

Diff Detail

Unit TestsFailed

TimeTest
60,040 msx64 debian > ThreadSanitizer-x86_64.ThreadSanitizer-x86_64::restore_stack.cpp
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -fsanitize=thread -Wall -m64 -msse4.2 -gline-tables-only -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/../ -std=c++11 -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/../ -O1 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/restore_stack.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/tsan/X86_64Config/Output/restore_stack.cpp.tmp && not /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/tsan/X86_64Config/Output/restore_stack.cpp.tmp 2>&1 | FileCheck /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/tsan/restore_stack.cpp
60,080 msx64 debian > libFuzzer.libFuzzer::large.test
Script: -- : 'RUN: at line 3'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang --driver-mode=g++ -O2 -gline-tables-only -fsanitize=address,fuzzer -I/var/lib/buildkite-agent/builds/llvm-project/compiler-rt/lib/fuzzer -m64 /var/lib/buildkite-agent/builds/llvm-project/compiler-rt/test/fuzzer/LargeTest.cpp -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/compiler-rt/test/fuzzer/X86_64DefaultLinuxConfig/Output/large.test.tmp-LargeTest

Event Timeline

Luhaocong created this revision.Jan 25 2022, 11:46 PM
Luhaocong requested review of this revision.Jan 25 2022, 11:46 PM
Luhaocong retitled this revision from [RISCV][NFC] eliminate rematerialization ofarray's base address to [RISCV][NFC] eliminate rematerialization of array's base address.Jan 26 2022, 12:00 AM
Luhaocong edited the summary of this revision. (Show Details)

Please remove NFC from the title. You changed the behavior of a test. That's clearly not "No Functional Change"

Luhaocong retitled this revision from [RISCV][NFC] eliminate rematerialization of array's base address to [RISCV] eliminate rematerialization of array's base address.Jan 26 2022, 5:21 PM
asb accepted this revision.Jan 28 2022, 6:32 AM

I'd appreciate more people kicking the tires on this, but it looks like an improvement to me. Doing the usual GCC torture suite recompile (obviously it's not representative of any real-world workload), this seems to be in a win in almost every single case. Thanks for the patch.

This revision is now accepted and ready to land.Jan 28 2022, 6:32 AM
Luhaocong updated this revision to Diff 406308.Feb 6 2022, 5:56 PM

update test case.

The only issue I have is that the description is specifically talking about arrays' base addresses but the change has wider-reaching consequences than just LICM: CSE, register allocation, sinking, etc. It's not even a loop-unroll specific change, right? This would apply to non-unrolled loops too, presumably.

So for me it'd be best for:
a) the commit title to reflect the code change itself
b) the description to describe why "MO_HI" is not as cheap as a move in principle
c) use array base address rematerialization as one motivating example.

jrtc27 added a comment.Feb 8 2022, 9:54 AM

This seems conceptually wrong to me. isAsCheapAsAMove is meant to model how expensive an instruction is to execute, but how you compute the immediate it uses has zero bearing on that. lui rd, 0x123 and lui rd, %hi(sym) if %hi(sym) happens to be 0x123 are completely indistinguishable from the perspective of the processor. To me this screams of something elsewhere getting cost modelling wrong.

This seems conceptually wrong to me. isAsCheapAsAMove is meant to model how expensive an instruction is to execute, but how you compute the immediate it uses has zero bearing on that. lui rd, 0x123 and lui rd, %hi(sym) if %hi(sym) happens to be 0x123 are completely indistinguishable from the perspective of the processor. To me this screams of something elsewhere getting cost modelling wrong.

Maybe the heuristic #1 should ignore the isCheapAsMove if the instruction isTriviallyRematerializable?

This seems conceptually wrong to me. isAsCheapAsAMove is meant to model how expensive an instruction is to execute, but how you compute the immediate it uses has zero bearing on that. lui rd, 0x123 and lui rd, %hi(sym) if %hi(sym) happens to be 0x123 are completely indistinguishable from the perspective of the processor. To me this screams of something elsewhere getting cost modelling wrong.

Maybe the heuristic #1 should ignore the isCheapAsMove if the instruction isTriviallyRematerializable?

So the problem is that LUI _is_ current marked as cheap as a move, since we set isAsCheapAsAMove to 1 on the TableGen record, and the default case falls back on that value. What this patch does is override that and make LUIs that are part of address computations _not_ as cheap as a move.

Luhaocong updated this revision to Diff 407102.Feb 9 2022, 2:41 AM
Luhaocong retitled this revision from [RISCV] eliminate rematerialization of array's base address to [RISCV] LUI used for address computation should not isAsCheapAsAMove.
Luhaocong edited the summary of this revision. (Show Details)

modify description

The only issue I have is that the description is specifically talking about arrays' base addresses but the change has wider-reaching consequences than just LICM: CSE, register allocation, sinking, etc. It's not even a loop-unroll specific change, right? This would apply to non-unrolled loops too, presumably.

So for me it'd be best for:
a) the commit title to reflect the code change itself
b) the description to describe why "MO_HI" is not as cheap as a move in principle
c) use array base address rematerialization as one motivating example.

Thank you very much for your useful advice and help.

This seems conceptually wrong to me. isAsCheapAsAMove is meant to model how expensive an instruction is to execute, but how you compute the immediate it uses has zero bearing on that. lui rd, 0x123 and lui rd, %hi(sym) if %hi(sym) happens to be 0x123 are completely indistinguishable from the perspective of the processor. To me this screams of something elsewhere getting cost modelling wrong.

Maybe the heuristic #1 should ignore the isCheapAsMove if the instruction isTriviallyRematerializable?

So the problem is that LUI _is_ current marked as cheap as a move, since we set isAsCheapAsAMove to 1 on the TableGen record, and the default case falls back on that value. What this patch does is override that and make LUIs that are part of address computations _not_ as cheap as a move.

I agree with you. This patch is only a feasible but not perfect scheme for binding the cost evaluation of address computation.
The more difficult is how to establish a cost model, which can analyze multiple related instructions at the same time.

This revision was landed with ongoing or failed builds.Feb 11 2022, 11:21 PM
This revision was automatically updated to reflect the committed changes.

I objected and still believe this patch is fundamentally wrong. The problem needs solving elsewhere, not like this. Please revert.

I objected and still believe this patch is fundamentally wrong. The problem needs solving elsewhere, not like this. Please revert.

Although this solution is far from perfect, it does improve code quality. Could you please show a test case that gets wrong or worse assembly by this patch?
I suggest we can keep it and go on searching better solutions.

I objected and still believe this patch is fundamentally wrong. The problem needs solving elsewhere, not like this. Please revert.

Although this solution is far from perfect, it does improve code quality. Could you please show a test case that gets wrong or worse assembly by this patch?
I suggest we can keep it and go on searching better solutions.

I don't know, but I don't really care, it is blatantly wrong to say LUI rd, <imm> is not as cheap as a move, especially so to say LUI rd, %hi(x) isn't but LUI rd, x is, it's complete nonsense. There are lots of things you can commit that would improve codegen quality but are totally wrong and would get backed out immediately.

I objected and still believe this patch is fundamentally wrong. The problem needs solving elsewhere, not like this. Please revert.

Although this solution is far from perfect, it does improve code quality. Could you please show a test case that gets wrong or worse assembly by this patch?
I suggest we can keep it and go on searching better solutions.

I don't know, but I don't really care, it is blatantly wrong to say LUI rd, <imm> is not as cheap as a move, especially so to say LUI rd, %hi(x) isn't but LUI rd, x is, it's complete nonsense. There are lots of things you can commit that would improve codegen quality but are totally wrong and would get backed out immediately.

Here is a case to show the "wrong" you said, it does exist. 's1' spill unexpected due to the inaccurate cost evaluation of lui s1, %hi(g)
Although this patch achieved greater codegen in most cases, it is really important to accurately describe the cost of instructions.
Thanks for your review, I will revert it and look for more reasonable way.

void func1(void);
void func2(void);

int g;

int foo(int a) {
  int ret = 0;
  ret += g;
  if (a > 0)
    func1();
  else
    func2();
  ret += g;
  return ret;
}
foo:
        addi    sp, sp, -32
        sd      ra, 24(sp)
        sd      s0, 16(sp)
        sd      s1, 8(sp)
        lui     s1, %hi(g)
        lw      s0, %lo(g)(s1)
        blez    a0, .LBB0_2
        call    func1
        j       .LBB0_3
.LBB0_2:
        call    func2
.LBB0_3:
        lw      a0, %lo(g)(s1)
        addw    a0, a0, s0
        ld      ra, 24(sp)
        ld      s0, 16(sp)
        ld      s1, 8(sp)
        addi    sp, sp, 32
        ret