This is an archive of the discontinued LLVM Phabricator instance.

[TwoAddressInstructionPass] Improve the SrcRegMap and DstRegMap computation
ClosedPublic

Authored by Carrot on Aug 25 2021, 2:40 PM.

Details

Summary

This patch contains following enhancements to SrcRegMap and DstRegMap:

1 In findOnlyInterestingUse not only check if the Reg is two address usage, but also check after commutation can it be two address usage.

2 If a physical register is clobbered, remove SrcRegMap entries that are mapped to it.

3 In processTiedPairs, when create a new COPY instruction, add a SrcRegMap entry only when the COPY instruction is coalescable. (The COPY src is killed)

With these enhancements isProfitableToCommute can do better commute decision, and finally more register copies are removed.

Diff Detail

Event Timeline

Carrot created this revision.Aug 25 2021, 2:40 PM
Carrot requested review of this revision.Aug 25 2021, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 25 2021, 2:40 PM
lkail added a subscriber: lkail.Aug 25 2021, 5:06 PM

Description makes it sound there are several changes here - do they stand on their own, or must they all happen all at once?

Description makes it sound there are several changes here - do they stand on their own, or must they all happen all at once?

In theory it can be divided into 3 individual changes. But they have the same purpose and impact. And two of them (findOnlyInterestingUse and processTiedPairs) are very small. So I sent them as a single patch. The total changes is 78 lines, still not a big one. A lot of test cases are impacted, so it caused the patch looks very large.

xbolva00 added inline comments.Sep 2 2021, 11:23 AM
llvm/test/CodeGen/X86/vec_smulo.ll
118
Carrot added inline comments.Sep 3 2021, 6:30 PM
llvm/test/CodeGen/X86/vec_smulo.ll
118

With this patch, TwoAddressInstructionPass generates

liveins: $xmm0, $xmm1, $rdi
%2:gr64 = COPY killed $rdi
%1:vr128 = COPY killed $xmm1
%0:vr128 = COPY killed $xmm0
%3:vr128 = PSHUFDri %1:vr128, -11
%4:vr128 = PSHUFDri %0:vr128, -11
%5:vr128 = COPY killed %4:vr128
%5:vr128 = PMULDQrr %5:vr128(tied-def 0), killed %3:vr128
%6:vr128 = COPY %0:vr128
%6:vr128 = PMULDQrr %6:vr128(tied-def 0), %1:vr128
%7:vr128 = PSHUFDri killed %6:vr128, -11
%8:vr128 = COPY killed %7:vr128
%8:vr128 = PBLENDWrri %8:vr128(tied-def 0), killed %5:vr128, -52
%9:vr128 = COPY killed %0:vr128
%9:vr128 = PMULLDrr %9:vr128(tied-def 0), killed %1:vr128
%10:vr128 = COPY %9:vr128
%10:vr128 = PSRADri %10:vr128(tied-def 0), 31
%11:vr128 = COPY killed %10:vr128
%11:vr128 = PCMPEQDrr %11:vr128(tied-def 0), killed %8:vr128
%12:vr128 = V_SETALLONES
%13:vr128 = COPY killed %12:vr128
%13:vr128 = PXORrr %13:vr128(tied-def 0), killed %11:vr128
MOVPQI2QImr killed %2:gr64, 1, $noreg, 0, $noreg, killed %9:vr128 :: (store (s64) into %ir.p2)
$xmm0 = COPY killed %13:vr128
RET 0, killed $xmm0

Without this patch, TwoAddressInstructionPass generates:

liveins: $xmm0, $xmm1, $rdi 
%2:gr64 = COPY killed $rdi 
%1:vr128 = COPY killed $xmm1
%0:vr128 = COPY killed $xmm0
%3:vr128 = PSHUFDri %1:vr128, -11
%4:vr128 = PSHUFDri %0:vr128, -11
%5:vr128 = COPY killed %4:vr128
%5:vr128 = PMULDQrr %5:vr128(tied-def 0), killed %3:vr128
%6:vr128 = COPY %0:vr128
%6:vr128 = PMULDQrr %6:vr128(tied-def 0), %1:vr128
%7:vr128 = PSHUFDri killed %6:vr128, -11
%8:vr128 = COPY killed %7:vr128
%8:vr128 = PBLENDWrri %8:vr128(tied-def 0), killed %5:vr128, -52
%9:vr128 = COPY killed %0:vr128
%9:vr128 = PMULLDrr %9:vr128(tied-def 0), killed %1:vr128
%10:vr128 = COPY %9:vr128
%10:vr128 = PSRADri %10:vr128(tied-def 0), 31
%11:vr128 = COPY killed %10:vr128
%11:vr128 = PCMPEQDrr %11:vr128(tied-def 0), killed %8:vr128
%12:vr128 = V_SETALLONES
%13:vr128 = COPY killed %11:vr128
%13:vr128 = PXORrr %13:vr128(tied-def 0), killed %12:vr128
MOVPQI2QImr killed %2:gr64, 1, $noreg, 0, $noreg, killed %9:vr128 :: (store (s64) into %ir.p2)
$xmm0 = COPY killed %13:vr128
RET 0, killed $xmm0

The only difference is the PXOR instruction and related COPY. The operands order(commuting decision) of PXOR is actually impacted the mapping of SrcRegMap[%10] = %9. In this instruction sequence, the old result is worse. Here we have SrcRegMap[%9] = xmm0, it lives until the memory store, so %10 must be assigned to a different physical register, and the COPY is a real one. And later %10 must be copied back to xmm0. In the new result, the %9 -> %10 is also a real copy, but the last %13 -> xmm0 COPY can be removed because %13 can be assigned to xmm0.

What makes the old result generate better final instructions? The answer is instruction scheduling. The memory store is moved before the %9 -> %10 copy, so in the COPY %9 is the last use, can be coalesced with %10 and assigned to xmm0, then both COPY instructions are removed. So the better old result is just lucky.

It implies a pass order problem here, different operands are killed in different instruction sequences, it impacts the optimal commuting decisions.

craig.topper added inline comments.Sep 17 2021, 8:46 PM
llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
401

Use TargetInstrInfo::CommuteAnyOperandIndex so this doesn't look like a member access?

437

Why not check isReg() inside the loop and call getReg() in the loop. That wouldn't be that expensive would it?

675

Is this assert no longer valid?

pengfei added inline comments.Sep 17 2021, 11:32 PM
llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
402

I'm confused about the code here. findRegisterUseOperandIdx return -1 when the function fails. But -1 is equal to CommuteAnyOperandIndex, which means we allow any operand to be commutable?

craig.topper added inline comments.Sep 17 2021, 11:57 PM
llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
402

I think it should never fail, UseMI is already known to use Reg.

craig.topper added inline comments.Sep 18 2021, 12:05 AM
llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
402

Maybe we should use use_nodbg_begin instead of use_instr_nodbg_begin earlier to get the MachineOperand, then use MachineOperand::getParent() to get UseMI. Then we can use the MachineOperand and MachineInstr::getOperandNo() here?

Carrot updated this revision to Diff 373666.Sep 20 2021, 11:29 AM
Carrot marked 3 inline comments as done.
Carrot added inline comments.
llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
675

Yes.
Suppose we have instruction

%102 = ADD killed %100, killed %101

When we process Reg=%100, findOnlyInterestingUse returns NewReg=%102, so a map entry

SrcRegMap[%102] = %100

is added.
Later when we process Reg=%101, since now we consider commuting operands, findOnlyInterestingUse also returns NewReg=%102, so %102 is mapped to a different register.

SrcRegMap[%102] = %101

craig.topper added inline comments.Sep 21 2021, 9:03 AM
llvm/test/CodeGen/X86/addsub-constant-folding.ll
54–55

What happened here?

llvm/test/CodeGen/X86/avx512-regcall-NoMask.ll
1080

Is this test using a lot more leas now?

llvm/test/CodeGen/X86/hhvm-cc.ll
106 ↗(On Diff #373666)

This looks worse

Carrot added inline comments.Sep 21 2021, 2:33 PM
llvm/test/CodeGen/X86/hhvm-cc.ll
106 ↗(On Diff #373666)

The code before TwoAddress pass is:

bb.0.entry:

liveins: $rbx, $r12, $rbp
%2:gr64 = COPY killed $rbp
%1:gr64 = COPY killed $r12
%0:gr64 = COPY killed $rbx
ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
%3:gr64 = MOV32ri64 42
$rbx = COPY killed %0:gr64
$r12 = COPY %1:gr64
$rbp = COPY killed %2:gr64
$r15 = COPY killed %3:gr64
CALL64pcrel32 target-flags(x86-plt) @php_short, <regmask $r12 $r12b $r12bh $r12d $r12w $r12wh>, implicit $rsp, implicit $ssp, implicit killed $rbx, implicit killed $r12, implicit killed $rbp, implicit killed $r15, implicit-def $rsp, implicit-def $ssp, implicit-def dead $rbx, implicit-def $rbp
ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp 
%5:gr64 = COPY killed $rbp 
%6:gr64 = ADD64rr killed %5:gr64(tied-def 0), killed %1:gr64, implicit-def dead $eflags
$rbx = COPY killed %6:gr64
RET 0, killed $rbx

In function call to php_short, $r12 is clobbered, with my patch, the map SrcRegMap[%1] = $r12 is cleared, it causes the following ADD instruction commuted. And later different registers are passed to isProfitableToConv3Addr, so different result is returned, and different lea/add instruction is generated.

I'm also surprised by the RA result, the live range of %1 crosses the function call, $r12 is killed and clobbered by the function call, why does $r12 still allocated to %1? The direct usage of $r12 after the function call looks wrong to me.

craig.topper added inline comments.Sep 22 2021, 9:37 AM
llvm/test/CodeGen/X86/hhvm-cc.ll
106 ↗(On Diff #373666)

The regmask for the function says r12 is preserved. All other registers are clobbered.

Carrot updated this revision to Diff 374410.Sep 22 2021, 5:51 PM
Carrot added inline comments.
llvm/test/CodeGen/X86/hhvm-cc.ll
106 ↗(On Diff #373666)

Thanks for the explanation! Now I understand the problem is at

$r12 = COPY %1:gr64

Before this instruction %1 is already mapped to $r12, copy it back to $r12 should not invalidate the map entry SrcRegMap[%1] = $r12. I should make the check in function removeClobberedSrcRegMap.

Carrot added inline comments.Sep 23 2021, 5:17 PM
llvm/test/CodeGen/X86/addsub-constant-folding.ll
54–55

The code before TwoAddress pass is:

%0:gr32 = COPY killed $edi 
 %1:gr32 = ADD32ri8 %0:gr32(tied-def 0), 8, implicit-def dead $eflags
 ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp 
 $edi = COPY killed %1:gr32
 CALL64pcrel32 target-flags(x86-plt) @use, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit killed $edi, implicit-def $rsp, implicit-def $ssp
 ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp 
 %2:gr32 = ADD32ri8 killed %0:gr32(tied-def 0), 10, implicit-def dead $eflags
 $eax = COPY killed %2:gr32
 RET 0, killed $eax

Without my patch LLVM generates LEA instructions because function isProfitableToConv3Addr returns true, with my patch function isProfitableToConv3Addr returns false. Although the original code generates good result, the reasoning process is wrong. Function isProfitableToConv3Addr returns true only when the src/dst registers mapped from/to different physical registers. In the ADD instruction we can see %2 is mapped to $eax, %0 is mapped from $edi, this is wrong because of the following instruction prevents coalescing of %0 and $edi.

$edi = COPY killed %1:gr32

With this patch LLVM correctly removes the mapping from %0 to $edi, but it causes worse result because this pass doesn't consider other RA constraints, live range interference in this case. In the ADD instruction now %0 doesn't map from any physical register, so this pass hopes it can be allocated to the same physical register as %2($eax), and keeps the two address ADD instruction. Unfortunately $eax is clobbered by the CALL instruction, so it is interfere with %0, $eax can't be allocated to %0, so the last COPY instruction becomes MOV.

Carrot added inline comments.Sep 27 2021, 7:32 PM
llvm/test/CodeGen/X86/avx512-regcall-NoMask.ll
1080

In this case there are many commutable instructions, with my code in findOnlyInterestingUse, more virtual registers can be mapped to %eax now, so more instructions can satisfy the last condition in isProfitableToConv3Addr

return (ToRegA && !regsAreCompatible(FromRegB, ToRegA, TRI));

The difficult part in this test case is there are so many commutable instructions, and 11 physical registers are copied from, each commutable instruction generates two map from possibilities, the total number of possible map from relations is a huge number. Also almost every virtual register can map to the result %eax. But in our implementation of SrcRegMap and DstRegMap, for each virtual register there is only one mapped from/to register. Also there is no sophisticated algorithm to choose a good mapping, we simply process each instruction one by one and update the mapping. So it is difficult to find the best mapping to guide instruction commute and 3-address instruction conversion.

foad added a subscriber: foad.Sep 28 2021, 7:07 AM
Carrot updated this revision to Diff 376074.Sep 29 2021, 5:09 PM

rebase.
Any other comments?

I can see you analyzed several regressions and thought the root causes are not the patch. I still have concern about the overall performance. Do you have any results on spec or benchmark that proves we are getting more gain with this patch?

I ran spec2006 on a skylake desktop, the result is 38.2 vs 38.3, so no difference.
I also checked the overall impact on impacted test case

carrot@carrot:~$ grep "\+\;" /tmp/patch | wc

2787   14813  114855

carrot@carrot:~$ grep "\-\;" /tmp/patch | wc

3138   16706  128560

About 351 instructions are deleted by this patch.

This revision is now accepted and ready to land.Oct 8 2021, 10:02 AM
This revision was landed with ongoing or failed builds.Oct 11 2021, 3:32 PM
This revision was automatically updated to reflect the committed changes.