This is an archive of the discontinued LLVM Phabricator instance.

[VirtRegRewriter] Insert missing killed flags when tracking subregister liveness
ClosedPublic

Authored by Conanap on Dec 1 2020, 10:18 AM.

Details

Summary

When using a register operand for the last time and if all its lanes are defined, virtregrewriter may fail to add the killed flag when tracking liveness of subregisters. On PPC, this may cause the generation of unnecessary code.
For example, for the following code:

0B	bb.0 (%ir-block.0):
	  liveins: $v2, $v3, $v4, $v5
16B	  undef %14.sub_vsx1:vsrprc_with_sub_64_in_vfrc = COPY $v5
32B	  %14.sub_vsx0:vsrprc_with_sub_64_in_vfrc = COPY $v4
48B	  undef %8.sub_vsx1:vsrprc_with_sub_64_in_vfrc = COPY $v3
64B	  %8.sub_vsx0:vsrprc_with_sub_64_in_vfrc = COPY $v2
80B	  %4:g8rc_and_g8rc_nox0 = LD 0, %fixed-stack.0 :: (load 8 from %fixed-stack.0, align 16)
160B	  %8:vsrprc_with_sub_64_in_vfrc = KILL_PAIR %8:vsrprc_with_sub_64_in_vfrc(tied-def 0)
176B	  undef %15.sub_pair0:uaccrc = COPY %8:vsrprc_with_sub_64_in_vfrc
256B	  %14:vsrprc_with_sub_64_in_vfrc = KILL_PAIR %14:vsrprc_with_sub_64_in_vfrc(tied-def 0)
288B	  %15.sub_pair1:uaccrc = COPY %14:vsrprc_with_sub_64_in_vfrc
304B	  %28:accrc = BUILD_UACC %15:uaccrc
336B	  %28:accrc = XXMTACC %28:accrc(tied-def 0)
340B	  SPILL_ACC %28:accrc, 0, %stack.0 :: (store 64 into %stack.0, align 16)
352B	  ADJCALLSTACKDOWN 32, 0, implicit-def dead $r1, implicit $r1
368B	  BL8_NOTOC @foo, <regmask $cr2 $cr3 $cr4 $f14 $f15 $f16 $f17 $f18 $f19 $f20 $f21 $f22 $f23 $f24 $f25 $f26 $f27 $f28 $f29 $f30 $f31 $r14 $r15 $r16 $r17 $r18 $r19 $r20 $r21 $r22 $r23 $r24 $r25 and 60 more...>, implicit-def dead $lr8, implicit $rm, implicit-def $r1
384B	  ADJCALLSTACKUP 32, 0, implicit-def dead $r1, implicit $r1
408B	  %25:accrc = RESTORE_ACC 0, %stack.0 :: (load 64 from %stack.0, align 16)
416B	  %25:accrc = XXMFACC %25:accrc(tied-def 0)
448B	  STXV %25.sub_vsx0:accrc, 48, %4:g8rc_and_g8rc_nox0 :: (store 16 into %ir.2 + 48)
480B	  STXV %25.sub_vsx1:accrc, 32, %4:g8rc_and_g8rc_nox0 :: (store 16 into %ir.2 + 32, align 32)
528B	  STXV %25.sub_pair1_then_sub_vsx0:accrc, 16, %4:g8rc_and_g8rc_nox0 :: (store 16 into %ir.2 + 16)
536B	  undef %26.sub_pair1_then_sub_vsx1:accrc = COPY %25.sub_pair1_then_sub_vsx1:accrc
560B	  STXVX %26.sub_pair1_then_sub_vsx1:accrc, $zero8, %4:g8rc_and_g8rc_nox0 :: (store 16 into %ir.2, align 64)
576B	  BLR8 implicit $lr8, implicit $rm

LiveIntervals::addKillFlags should add a killed flag to the %28 operand of the SPILL_ACC instruction (340B).
The live interval for %28 is:

%28 [304r,336r:0)[336r,340r:1)  0@304r 1@336r L0000000000000080 [304r,336r:0)[336r,340r:1)  0@304r 1@336r L0000000000000040 [304r,336r:0)[336r,340r:1)  0@304r 1@336r L0000000000000002 [304r,336r:0)[336r,340r:1)  0@304r 1@336r L0000000000000100 [304r,336r:0)[336r,340r:1)  0@304r 1@336r weight:INF
RegMasks: 368r

We see that lanes L0000000000000080, L0000000000000040, L0000000000000002 and L0000000000000100 are all live until 340B.
The full lane mask of the accumulator registers (accrc) is L00000000000001C2 meaning that all the lanes used for the SPILL are defined and the flag should be added.

This patch fixes two things:

  1. The way the mask for defined lanes is computed. We currently compute the mask L0000000000000000 instead of L00000000000001C2. To fix that, we go through all the segments before the instruction using the register operand to see which lanes are defined and killed at this instruction.
  1. The way the mask for used lanes is computed. In case we use a register that is not a subregister, we currently use the conservative default mask LFFFFFFFFFFFFFFFF to represent used lanes. However, this mask is not necessarily accurate. As shown above, on PPC, the mask should be L00000000000001C2 for accumulators. So we try to retrieve this mask from the register class instead of using the default mask.

I also need someone familiar with the AMDGPU backend to check if the test cases are still correct after these changes.

Diff Detail

Event Timeline

bsaleil created this revision.Dec 1 2020, 10:18 AM
bsaleil requested review of this revision.Dec 1 2020, 10:18 AM
arsenm added a subscriber: arsenm.Dec 14 2020, 6:34 AM
arsenm added inline comments.
llvm/lib/CodeGen/LiveIntervals.cpp
705–706

What was the point of copying out the subranges here before?

728–750

This looks like a separate NFC change

794–795

getRegClass should be sufficient. There can't be a vreg without a class at this point

llvm/test/CodeGen/PowerPC/subreg-killed.mir
31–101

Should be able to drop most if not all of this

110–132

Can you simplify this at all? -run-pass=none after deleting the register section will also compact these vreg numbers

lkail added a subscriber: lkail.Dec 17 2020, 1:32 AM
lkail added inline comments.
llvm/lib/CodeGen/LiveIntervals.cpp
794–795

I think it can be simplified as

unsigned SubReg = MO.getSubReg();
LaneBitmask UseMask = SubReg ? TRI->getSubRegIndexLaneMask(SubReg)
                             : MRI->getMaxLaneMaskForVReg(Reg);
bsaleil updated this revision to Diff 314792.Jan 5 2021, 8:56 PM

Remove unrelated NFC change, simplify lane mask computation and MIR code

bsaleil marked 4 inline comments as done.Jan 5 2021, 9:01 PM
bsaleil added inline comments.
llvm/lib/CodeGen/LiveIntervals.cpp
705–706

To be honest, I don't understand why we copied the subranges here. @MatzeB, do you have any idea ?

728–750

you're right, I removed that.

794–795

Thanks, good catch @lkail I simplified that.

bsaleil marked 2 inline comments as done.Jan 12 2021, 10:09 AM

Gentle ping :)

arsenm added inline comments.Jan 20 2021, 10:24 AM
llvm/test/CodeGen/PowerPC/subreg-killed.mir
19–31

Don't need the IR section

34–78

Can drop most of this (I think you just need tracksRegLiveness: true)

bsaleil updated this revision to Diff 319198.Jan 25 2021, 8:32 PM

Rebase and reduce PPC test case.

bsaleil marked 2 inline comments as done.Jan 25 2021, 8:32 PM
foad added a subscriber: foad.Jan 26 2021, 2:45 AM
foad added inline comments.
llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
16–17

Why do you need this change?

bsaleil added inline comments.Feb 8 2021, 7:06 AM
llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
16–17

@foad, If I understand correctly the test case, the generated code is divided in two parts, spills and reloads.
The two first check lines (s_mov_b32 and scratch_store_dwordx4) are supposed to match a spill, the two other check lines (s_movk_i32 and scratch_load_dwordx4) are supposed to match a reload.
Currently, the first two check lines are correctly matched in the spills part.
But s_movk_i32 is actually also matched in the spills section. It turns out that there is a scratch_load_dwordx4 for that same register in the reload part, so the test succeeds.
This patch modifies a bit the register allocation for this test case and now there is no scratch_load_dwordx4 for that register.
By adding the 1 here, we force the s_movk_i32 to be matched in the reload section because no s_movk_i32 uses 0x1 in the spills section.
Then, the scratch_load_dwordx4 for that register is correctly matched.

foad added inline comments.Feb 8 2021, 9:41 AM
llvm/test/CodeGen/AMDGPU/spill-scavenge-offset.ll
16–17

Makes sense, thanks.

Conanap commandeered this revision.Feb 18 2021, 8:32 AM
Conanap added a reviewer: bsaleil.
stefanp accepted this revision as: stefanp.Mar 1 2021, 12:10 PM
stefanp added a subscriber: stefanp.

LGTM.

This revision is now accepted and ready to land.Mar 1 2021, 12:10 PM
Conanap updated this revision to Diff 327569.Mar 2 2021, 1:47 PM

A new test case started failing in the process of upstreaming this patched; updated that test case.

This revision was landed with ongoing or failed builds.Mar 3 2021, 9:02 AM
This revision was automatically updated to reflect the committed changes.