This is an archive of the discontinued LLVM Phabricator instance.

Allow constraining virtual register's class within reason
Needs ReviewPublic

Authored by alexey.zhikhar on Jul 30 2018, 10:06 AM.

Details

Summary

At the very end of instruction selection, in InstrEmitter,handle
overlapping register classes to eliminate redundant copy instructions.

Also, update various tests.

Given the following code:

a = def
c = CopyToReg a

The current implementation of InstrEmitter checks whether a and`c`
belong to the same register class, and, if so, coalesces CopyToReg away:

a = def
c = CopyToReg a
    =>
c = def

In pseudocode, the algorithm can be expressed as

if RegClass(c) == RegClass(a):
    make it "c = def"

However, in a case where register classes are not exactly equal
but overlap, the CopyToReg is not eliminated. This patch checks
whether two register classes overlap and the number of registers
in the overlap is greater than MinRCSize. In pseudocode:

if |RegClass(c) ∩ RegClass(a)| ≥ MinRCSize:
    make it "c = def"

Corresponding discussion on llvm-dev:

http://lists.llvm.org/pipermail/llvm-dev/2018-May/123663.html
https://groups.google.com/forum/#!topic/llvm-dev/BHFhRkYY2ng

Patch by Ulrich Weigand.

Diff Detail

Event Timeline

Probably needs tests. (Not that i know how to write one here)

alexey.zhikhar edited the summary of this revision. (Show Details)Jul 30 2018, 10:13 AM
alexey.zhikhar edited the summary of this revision. (Show Details)

Probably needs tests. (Not that i know how to write one here)

Unfortunately, our backend is out-of-tree, so we can't upstream test cases for this patch. Jonas, you previously mentioned that this patch helps Z, do you happen to have a test case? @jonpa

jonpa added a comment.Jul 31 2018, 1:48 AM

Probably needs tests. (Not that i know how to write one here)

Unfortunately, our backend is out-of-tree, so we can't upstream test cases for this patch. Jonas, you previously mentioned that this patch helps Z, do you happen to have a test case? @jonpa

IIRC, we tried this patch in an experimental setting (when evaluating a certain aspect of the SystemZ backend), so the test case we had then will not work on trunk :-/

I could derive a new test case quite easily, but I am personally not that sure exactly what to test for. I mean, after the coalescer, register allocator etc, with all those complex interactions of multiple optimizations, what type of test should we have that express the thing we are trying to fix? Is there really a clear case which really should be improved? Or is it more that some cases improve while others end up getting worse?

I made a test evaluation of this patch a while ago when I posted it, and it seemed then that the total number of COPYs marginally improve (decrease), while there is a very slight increase in spilling.

I could make a new test case that simply improves by e.g. having a COPY less, but I would like to know that this is really handling something specific and not just randomly end up getting better...

Is this making sense, and what are your thoughts? What happens if you run your out-of-tree test cases with the SystemZ (-mcpu=z13) backend? Do you see any improvement? In that case, it seems like we have a test of real worth...

Is this making sense, and what are your thoughts? What happens if you run your out-of-tree test cases with the SystemZ (-mcpu=z13) backend? Do you see any improvement? In that case, it seems like we have a test of real worth...

The test cases that we have heavily rely on our backend's intrinsics, so they cannot be compiled for SystemZ.

Providing a reliable test case for this patch does not seem as an easy (or even worthwhile) task but I'm open to suggestions.

This patch fails 14 tests; however, it seems that the problem is not in the patch but in flaky test cases. I took a quick look at CodeGen/PowerPC/vsx.ll: the test case fails after not finding copy instructions, which were redundant and expectedly removed by this patch.

Failing Tests (14): 
    LLVM :: CodeGen/AArch64/and-sink.ll
    LLVM :: CodeGen/AArch64/arm64-atomic.ll
    LLVM :: CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll
    LLVM :: CodeGen/AArch64/combine-comparisons-by-cse.ll
    LLVM :: CodeGen/AArch64/optimize-cond-branch.ll
    LLVM :: CodeGen/AArch64/redundant-copy-elim-empty-mbb.ll
    LLVM :: CodeGen/AMDGPU/early-if-convert.ll
    LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
    LLVM :: CodeGen/ARM/2011-08-25-ldmia_ret.ll
    LLVM :: CodeGen/ARM/atomic-64bit.ll
    LLVM :: CodeGen/ARM/atomic-cmp.ll
    LLVM :: CodeGen/ARM/atomic-ops-v8.ll
    LLVM :: CodeGen/PowerPC/vsx.ll
    LLVM :: CodeGen/SystemZ/cond-move-03.ll
jonpa added a comment.Jul 31 2018, 6:42 AM

This is the test case we were working with. Compile with

llc -mcpu=z13 ./tc_2spill.ll

Looking at it now, it seems that with the patch there are fewer COPYs directly after instruction selection, but it seems that in the output there are the same number of register moves, with the less desired difference that with the patch one of them is no longer hoisted out of the loop. So in this case it seems that the net result is one less hoisted register move... :-/

This patch fails 14 tests; however, it seems that the problem is not in the patch but in flaky test cases. I took a quick look at CodeGen/PowerPC/vsx.ll: the test case fails after not finding copy instructions, which were redundant and expectedly removed by this patch.

I'm currently working on fixing the failures.

@jonpa Jonas, I see a different assembly in one of SystemZ unit tests; my memory of Z is pretty fuzzy, so could you please make sense of the change to see whether it is reasonable?

Test name is CodeGen/SystemZ/cond-move-03.ll, test case f2().

Before the change:

        #APP
        dummy %r0 
        #NO_APP
        #APP
        stepa %r1 
        #NO_APP
        #APP
        stepb %r0 
        #NO_APP
        clijhe  %r2, 42, .LBB1_2
# %bb.1:
        risblg  %r0, %r1, 0, 159, 32
.LBB1_2:
        risbhg  %r1, %r0, 0, 159, 32
        #APP
        stepc %r1 
        #NO_APP
        #APP
        dummy %r0 
        #NO_APP
        br      %r14

After the change:

#APP
dummy %r0 
#NO_APP
#APP
stepa %r1 
#NO_APP
#APP
stepb %r0 
#NO_APP
clfi    %r2, 42
risbhg  %r2, %r0, 0, 159, 32
locfhrl %r2, %r1 
#APP
stepc %r2 
#NO_APP
#APP
dummy %r0 
#NO_APP
br      %r14

The new code for f2 in cond-move-03.ll is in fact better, since it now actually uses the conditional move instruction instead of a branch ...

The new code for f2 in cond-move-03.ll is in fact better, since it now actually uses the conditional move instruction instead of a branch ...

Thanks, Ulrich. Does it fix FIXME: We should commute the LOCRMux to save one move. or is it unrelated?

The new code for f2 in cond-move-03.ll is in fact better, since it now actually uses the conditional move instruction instead of a branch ...

Thanks, Ulrich. Does it fix FIXME: We should commute the LOCRMux to save one move. or is it unrelated?

It doesn't so much fix it, as make it no longer applicable. Given that we use the hi->hi conditional move, we must have a lo->hi move (the risbgh) anyway in this constellation.

So in short, yes, you can remove the fixme now :-)

All the test failures are mismatches between expected assembly and assembly produced. One exception is CodeGen/AMDGPU/early-if-convert.ll, where the backend fails with an assertion:

llc: lib/Target/AMDGPU/SIInstrInfo.cpp:1794: virtual bool llvm::SIInstrInfo::canInsertSelect(const llvm::MachineBasicBlock&, llvm::ArrayRef<llvm::MachineOperand>, unsigned int, unsigned int, int&, int&, int&) const: Assertion `MRI.getRegClass(FalseReg) == RC' failed.

I will prioritize this higher.

alexey.zhikhar edited the summary of this revision. (Show Details)

Differential is updated with:

  • Fixed unit tests for SystemZ and PowerPC
  • Bug fix for AMDGPU: AMD's TargetInstrInfo::canInsertSelect() had an assertion that was too restrictive + fixed missing switch cases.
  • Fixed unit tests for atomic operations on ARM: note that the number of mov-s did not increase after applying the patch.

@t.p.northover @asl @rengolin

For some of the failing ARM/AArch64 tests, I see additional mov-s being performed; for example, in CodeGen/AArch64/and-sink.ll: if you take a look at the assembly after applying the patch, you will see an addtional mov for the trace when %c (w1) equals to zero. I'm not sure how important it is, so I would appreciate some feedback from ARM/AArch64 backend people. Please note that for performance-critical atomic compare-and-swap operations, performance is unchanged.

; RUN: llc -mtriple=aarch64-linux-gnu -verify-machineinstrs < %s | FileCheck %s

@A = global i32 zeroinitializer

; Test that and is sunk into cmp block to form tbz.
define i32 @and_sink1(i32 %a, i1 %c) {
  %and = and i32 %a, 4
  br i1 %c, label %bb0, label %bb2
bb0:
  %cmp = icmp eq i32 %and, 0
  store i32 0, i32* @A
  br i1 %cmp, label %bb1, label %bb2
bb1:
  ret i32 1
bb2:
  ret i32 0
}

Original assembly:

and_sink1:                              // @and_sink1
        .cfi_startproc
// %bb.0:
        tbz     w1, #0, .LBB0_3
// %bb.1:                               // %bb0
        adrp    x8, A
        str     wzr, [x8, :lo12:A]
        tbnz    w0, #2, .LBB0_3
// %bb.2:
        orr     w0, wzr, #0x1
        ret
.LBB0_3:                                // %bb2
        mov     w0, wzr
        ret
.Lfunc_end0:
        .size   and_sink1, .Lfunc_end0-and_sink1
        .cfi_endproc

Assembly after applying the patch:

and_sink1:                              // @and_sink1
        .cfi_startproc
// %bb.0:
        tbz     w1, #0, .LBB0_2
// %bb.1:                               // %bb0
        adrp    x8, A
        str     wzr, [x8, :lo12:A]
        orr     w8, wzr, #0x1
        tbz     w0, #2, .LBB0_3
.LBB0_2:                                // %bb2
        mov     w8, wzr
.LBB0_3:                                // %bb1
        mov     w0, w8
        ret
.Lfunc_end0:
        .size   and_sink1, .Lfunc_end0-and_sink1
        .cfi_endproc

Here's a list of ARM/AArch64 tests that are suspicious due to additional mov operations:

LLVM :: CodeGen/AArch64/redundant-copy-elim-empty-mbb.ll
LLVM :: CodeGen/ARM/2011-04-11-MachineLICMBug.ll
LLVM :: CodeGen/AArch64/and-sink.ll
LLVM :: CodeGen/AArch64/arm64-fast-isel-conversion-fallback.ll
LLVM :: CodeGen/AArch64/optimize-cond-branch.ll

Also, CodeGen/ARM011-08-25-ldmi_ret.ll spills one additional register.

I'd like an explanation for why the generated code is changing for AArch64... generating extra copies clearly seems like a downside. And there isn't any obvious reason for this change to impact register allocation: on AArch64, all i32 register classes contain exactly the same set of allocatable registers.

I'd like an explanation for why the generated code is changing for AArch64... generating extra copies clearly seems like a downside. And there isn't any obvious reason for this change to impact register allocation: on AArch64, all i32 register classes contain exactly the same set of allocatable registers.

Agreed. ARM people might have some thoughts of what the root cause might be, and I'd love to investigate the leads.

nhaehnle removed a subscriber: nhaehnle.Aug 8 2018, 5:09 AM