This is an archive of the discontinued LLVM Phabricator instance.

[MachineSink][AArch64] Sink instruction copies when they can replace copy into hard register or folded into addressing mode
ClosedPublic

Authored by chill on Jun 13 2023, 9:36 AM.

Details

Summary

This patch adds a new code transformation to the MachineSink pass,
that tries to sink copies of an instruction, when the copies can be folded
into the addressing modes of load/store instructions, or
replace another instruction (currently, copies into a hard register).

The criteria for performing the transformation is that:

  • the register pressure at the sink destination block must not exceed the register pressure limits
  • the latency and throughput of the load/store or the copy must not deteriorate
  • the original instruction must be deleted

Diff Detail

Event Timeline

chill created this revision.Jun 13 2023, 9:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:36 AM
chill requested review of this revision.Jun 13 2023, 9:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 13 2023, 9:36 AM
chill planned changes to this revision.Jun 15 2023, 5:43 AM
chill updated this revision to Diff 532723.Jun 19 2023, 11:39 AM

Bug fixes.

chill planned changes to this revision.Jun 20 2023, 1:40 AM
chill updated this revision to Diff 532977.Jun 20 2023, 10:09 AM
chill updated this revision to Diff 538152.Jul 7 2023, 8:18 AM
chill updated this revision to Diff 538174.Jul 7 2023, 9:32 AM

Update:

  • added a command line option --aarch64-enable-sink-fold=[true|false]
  • fixed a test

A few nitpicks.

llvm/include/llvm/CodeGen/TargetInstrInfo.h
1449

typo

1463

typo

llvm/lib/CodeGen/MachineSink.cpp
483

pressure

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2767

indentation

3095

indentation

High level comment - I like having the ability to do this, is there a reason it is a sink, as opposed to a peephole? I think a lot of the time it may just always be profitable to combine into address operands or add+lsl. Is it due to the register pressure?

chill updated this revision to Diff 547261.Aug 4 2023, 9:43 AM

Update:

  • added folding of 32-bit zero-/sign-extends into load/store addressing mode
  • removed an arbitrary restriction of not folding instructions in the same basic block
  • implemented a slightly more efficient cleanup of dead COPYs
  • removed references to nonexistent virtual registers from debug instructions
chill updated this revision to Diff 547264.Aug 4 2023, 9:46 AM

.. and now with the correct patch ... :x

chill updated this revision to Diff 547269.Aug 4 2023, 10:15 AM
chill marked 5 inline comments as done.Aug 4 2023, 10:16 AM
chill added a comment.Aug 4 2023, 10:21 AM

High level comment - I like having the ability to do this, is there a reason it is a sink, as opposed to a peephole? I think a lot of the time it may just always be profitable to combine into address operands or add+lsl. Is it due to the register pressure?

Yes, I think we should take register pressure into account, as we can replace use of one register with a use of two registers (e.g. ldr Xd, [Xa] -> ldr Xd, [Xn, Xm]).

chill updated this revision to Diff 548667.Aug 9 2023, 10:21 AM

Fixed a miscompilation.

There is a lot of code here, but it looks like a good idea. I was looking at folding shifts into and/or recently too, and it looked like there were cases where it is better to be careful about where it is combined based on the uses. The same is likely true for add/sub and addressing modes too.

llvm/lib/CodeGen/MachineSink.cpp
511

Doesn't need a ; to end the inside of the LLVM_DEBUG.

535

-> addressing mode memory instruction?

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2792

addressing

2939

I think it's worth clarifying how these should work with LSLFast and addressing operands. I had an explanation of how I thought LSLFast should work in https://reviews.llvm.org/D155470#4527270. Let me know if you think doesn't sound right. There is a patch to split LSLFast into multiple parts as a first step in https://reviews.llvm.org/D157982.

3340

Would constrainRegClass work too, to avoid the COPY?

3359

instruction

9217

shift -> Shift

llvm/test/CodeGen/AArch64/arm64-xaluo.ll
105 ↗(On Diff #548667)

Can you regenerate these tests in the parent, to remove the unrelated changes from here?

llvm/test/CodeGen/AArch64/sink-and-fold.ll
305

Are these tests new? Can this be removed in the parent?

chill added inline comments.Aug 17 2023, 2:09 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
3340

We come here with virtual register in AM.BaseReg in the GPR64 register class, but the load/store instruction needs a register in the GPR64sp, which is not a subclass of GPR64 (it's less constrained, a superclass).

IIUC, constrainRegClass would work in the opposite direction of what we need, from GPR64sp to GPR64.

Alternatively, that might be fixed in the verifier, to not complain if an operand is not in the exact register class, but is nevertheless in a more constrained class (a subclass).

chill added inline comments.Aug 17 2023, 2:14 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
3340

All right, having said that, I just tried using constrainRegClass and it worked. I'll look some more into it.

chill added inline comments.Aug 17 2023, 2:25 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
3340

I see what happens, neither one of GPR64 or GPR64sp is a superclass/subclass of the other as I thought (one has XZR where the other has SP). So constrainRegClass yields GPR64common and the machine verifier already works exactly suggested above.

chill added inline comments.Aug 17 2023, 3:25 AM
llvm/test/CodeGen/AArch64/sink-and-fold.ll
305

These tests are supposed to be in the new patch as well.

chill updated this revision to Diff 551097.Aug 17 2023, 5:21 AM
chill marked 7 inline comments as done.Aug 17 2023, 5:23 AM
chill marked an inline comment as done.

I've been wondering how to stage this, whilst trying to move towards https://reviews.llvm.org/D155470#4527270. I think this implements something closer to "Ext23Fast" from the 4 options there, but checks for LSLFast.

It might be best to go with this, and then we can adjust the Target features and clean up the uses in another patch. Otherwise we are trying to do too many things at once, and there is already quite a bit of code here. I have some questions inline about the folding of add's, but otherwise from what I can tell this looks good.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2936

I believe it is shifts of 1 or 4 that would be more expensive for OoO cores, but the other shift types are also cheap. AddrLSLFast means any addressing mode with a LSL with shift <= 3 are cheap. ALULSLFast means adds/subs with LSL<=4 are fast.

I think the logic should be similar to that in DAGCombine (ignoring register pressure for a moment). If we are optimizing for size or there are no other uses the fold should be beneficial. Otherwise we treat it as cheap if we have AddrLSLFast and the shift is <= 3. An ADDXrs could take 2 cycles anyway so could be more aggressive?

Does this take into account the number of uses, and should it? Should it fold more under Optsize?

chill updated this revision to Diff 556265.Sep 8 2023, 8:16 AM
chill marked 2 inline comments as done.Sep 8 2023, 8:22 AM
chill added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2936

Yeah, that makes sense. We already perform the transformation only if the original instruction is removed (i.e. can be folded into all of its users), if we are also optimizing for size, then we can ignore here the (minor) increase on the cycle count, if that would increase the chances for the transformation happening.

2939

I guess we can be more precise by distinguishing between cores where shifts 0, 1, 2, and 3 are fast
and cores where just shifts 0, 2, and 3 are fast. Not sure if it's worth adding an extra feature. Lacking such a feature, for now the code considers shift 1 to be slow by default, but it could just as well consider shift 1 to be fast by default.

I have been wondering what to do with this, whether to nitpick the costmodel in ways that might not be useful, but it is probably better to get it in and work out any adjustments to the target features as needed. I think this is closer to what we are aiming for in terms of heuristics.

Is it worth setting EnableSinkAndFold to false, we can commit this then have another patch (doesn't need review) to enable it? Just in case there are problems it can help ease the commit-revert cycles.

Otherwise I was looking through the sinking code and had an extra question.

llvm/lib/CodeGen/MachineSink.cpp
467–468

Is this assuming that the UseInst is a copy or that canFoldIntoAddrMode instructions always have an Operand(1) which is a reg? Is UseInst.getOperand(1) always MO in for AArch64?

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
2939

Arm OoO cores can treat any extend as cheap, so long as the shift is 2 or 3. For in order cores it appears to be any shift. My understanding of LSLFast was the the UXTW/SXTW would still not be considered cheap. What you have is probably good though, and matches my understanding of what many cores implement.

chill marked 2 inline comments as done.Sep 18 2023, 2:54 AM
chill added inline comments.
llvm/lib/CodeGen/MachineSink.cpp
467–468

The comment is obsolete and the expression is incorrect.

We have here the original instruction

DefReg = opc UsedRegA, UsedRegB

and a chain of copies (maybe empty)

DefReg ->  ... -> Reg

and the instruction we're trying to fold into

... = opc Op0, Op1, ..., Reg, ...

We are replacing Reg with some expression, which involves UsedRegA and UsedRegB , so potentially where we had one register now we would have two, so register pressure may increase.
However, if Reg is the in the same register class as UsedRegA or UsedRegB, then register pressure for that register class does not increase, as we are simply replacing one virtual reg with another virtual reg.

The expression should be just const TargetRegisterClass *RCS = MRI->getRegClass(Reg);

chill updated this revision to Diff 557043.Sep 19 2023, 7:16 AM
chill marked 2 inline comments as done.
dmgreen accepted this revision.Sep 19 2023, 11:56 PM

LGTM, thanks

This revision is now accepted and ready to land.Sep 19 2023, 11:56 PM
This revision was landed with ongoing or failed builds.Sep 25 2023, 2:50 AM
This revision was automatically updated to reflect the committed changes.