This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine a load into GPR followed by a copy to FPR to a load into FPR directly through MIPeepholeOpt
AbandonedPublic

Authored by mingmingl on Jul 19 2022, 9:59 AM.

Details

Reviewers
dmgreen
fhahn
kazu
Summary

Example pattern before:

%1:gpr64 = LDRXui ...
%2:fpr64 = COPY %1:gpr64

Pattern after:
%1:gpr64 = LDRDui

The pattern show up when elements are read from arrays, and moved into floating
point registers for calculation.

Diff Detail

Event Timeline

mingmingl created this revision.Jul 19 2022, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2022, 9:59 AM
mingmingl requested review of this revision.Jul 19 2022, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 19 2022, 9:59 AM
mingmingl retitled this revision from [aarch64-mi-peepholeopt] Combine a gpr->fp copy of load gpr into load gp directly. to [AArch64] Combine a load into GPR followed by a copy to FPR to a load into FPR directly through MIPeepholeOpt.
mingmingl edited the summary of this revision. (Show Details)Jul 21 2022, 12:33 AM
mingmingl edited the summary of this revision. (Show Details)Jul 21 2022, 12:37 AM
mingmingl added reviewers: dmgreen, fhahn.

Implementation-wise, use MachineRegisterInfo::replaceRegWith to replace all uses of Reg; in this way, debug uses of the register will be updated as well.

Besides, add a dedicated MIR test for this optimization, so that machine instruction update (including MachineInstr::memoperands, etc) are visible.

kazu added inline comments.Jul 22 2022, 10:14 PM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
108

Declare this with const. Alternatively, turn the definition into a function in the anonymous namespace outside the class.

109

Declare this with const. Alternatively, turn the definition into a function in the anonymous namespace outside the class.

110

Declare this with const. Alternatively, turn the definition into a function in the anonymous namespace outside the class.

222

Transformation before sounds a bit strange. I'd suggest Before transformation or simply`Before`.

227

Likewise.

238

Reduce indentation with an early return like so:

if (!SrcOperand.isReg() || !MRI->hasOneNonDBGUse(SrcOperand.getReg()))
  false;
251–252

Reduce indentation with an early return like so:

if (!isFPRRegister(DstRegClass) || !isGPRegister(SrcRegClass) ||
    !hasSameNumberOfBits(DstRegClass, SrcRegClass))
  return false;
286

Remove this as you have an identical line about 50 lines above.

302–328

Do these two blocks of code trigger at all? I tried ninja check-llvm with assert(0) placed in these two blocks, but neither one triggered. It may be tedious to have a test case for every case in the first switch statement, but I'd suggest at least one test case for each block in the second switch statement -- partly for correctness and partly for coverage.

332

Reduce indentation with an early return like so:

if (NewOpCode == -1)
  return false;

You might consider placing this check between the two switch statements because the second switch statement doesn't change NewOpCode.

335

Remove this is to fix the comma splice.

Do you have details on the motivating case for this? I was under the impression that it was usually handled by ISel, but the tests certainly suggest it isn't always the case. There will be times for DAG ISel which cross basic-blocks, which it won't be able to match. A number of the other test changes look like artifacts from how tablegen patterns are defined.

The AArch64MIPeepholeOptimizer seems to be a rich source of bugs. Every time we make a change in it, it takes several attempt to get it correct.

Allen added a subscriber: Allen.Jul 23 2022, 8:30 AM
mingmingl added a comment.EditedJul 24 2022, 2:36 PM

Do you have details on the motivating case for this? I was under the impression that it was usually handled by ISel, but the tests certainly suggest it isn't always the case. There will be times for DAG ISel which cross basic-blocks, which it won't be able to match. A number of the other test changes look like artifacts from how tablegen patterns are defined.

This is a good point. I understand that DAG ISel is probably a better place to handle this combine, to avoid fiddling with various kinds of load/store operations.

As shown in the .mir test case, the machine instructions are from the same basic block (not cross basic blocks, thereby code sinking won't help)

Existing UIntToFPROLoadPat or VecROLoadPat in AArch64InstrInfo.td won't catch this; in particular, the test case uses load with unsigned offset (not register offset), so not caught by VecROLoadPat.

Debug log of instruction-selection shows it's very likely due to

  1. lack of type legalization (from int64 to v1i64 or something similar)
  2. a pattern named VecUILoadPat, similar to what VecROLoadPat does for load with register offset.

    I'll look into the details and see what's missing.

btw for simplicity, the patch is about load, but I believe the same optimization opportunity exists for store operations.

The AArch64MIPeepholeOptimizer seems to be a rich source of bugs. Every time we make a change in it, it takes several attempt to get it correct.

Thanks for pointing out a different place to solve the same problem. I share the concern that general memory operations are hard to rewrite correctly, and even if they are correct, it's better to make them happen earlier in the pipeline.

mingmingl added inline comments.Jul 24 2022, 2:40 PM
llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
302–328

I didn't see them being triggered from the affected test case from a cursory look (i.e., most affected load operations in the affected test case come without indices)

It definitely makes sense to add the test coverage, even if the fix would be in another place eventually.

mingmingl planned changes to this revision.Jul 24 2022, 2:41 PM

An update:

  1. For the motivating use case (around pmull intrinsic and instruction), I sent out D130548.
  2. The general issue (missed combination of {load + copy -> load} exists.

For 2), peephole-opt should have folded the load (code) given AArch64InstrInfo.cpp considers this pattern (mentioned in comment). Going to see why it's not executed and update back.

mingmingl marked 6 inline comments as done and an inline comment as not done.EditedAug 1 2022, 11:12 AM

Do you have details on the motivating case for this? I was under the impression that it was usually handled by ISel, but the tests certainly suggest it isn't always the case. There will be times for DAG ISel which cross basic-blocks, which it won't be able to match. A number of the other test changes look like artifacts from how tablegen patterns are defined.

The AArch64MIPeepholeOptimizer seems to be a rich source of bugs. Every time we make a change in it, it takes several attempt to get it correct.

It's true that most of the cross-register-back copies are introduced in the (multi-step) ISel pass. May I hear some feedback on whether the general idea of combining {ldr + copy} into {ldr} is good to go, given that 1) fixing ISel one-by-one could be laborious 2) peephole-opt does more clean up (say the {ldr+copy} -> ldr is exposed after ISel, as a result of multiple passes execution)

Also update the implementation by bailing out on illegal conditions

llvm/lib/Target/AArch64/AArch64MIPeepholeOpt.cpp
302–328

Leaving this open before go/no-go discussion with peephole converges.

mingmingl updated this revision to Diff 449080.Aug 1 2022, 11:14 AM
mingmingl marked an inline comment as not done.

Update implementation to bail out on illegal conditions, major ones include

  1. ldr is not safe to move
  2. load fold barriers exist between ldr and copy
  3. ldr or copy are a part of MI bundle.

I worry about the potential for bugs in any patch in the AArch64MIPeepholeOptimizer. And loading makes that more complex. The altered tests look like they are either intrinsics like the pmull64 that was recently changed or strange types (<32 x i1>, <16 x i12>) that don't come up a lot in practice. The global-isel change in dp1.ll is a missed fold that is probably better handled in global-isel. The irregular type tests also show it only handling straight loads, not instructions like ld1 { v1.s }[1], [x12]. The current model is to try and handle almost all combines during ISel so that they can all work together in one place.

If this comes up in more places it might be worth it. From the examples in the tests I'm not so sure though. Do you have any other examples where this helps?

mingmingl abandoned this revision.Aug 3 2022, 1:53 PM

I worry about the potential for bugs in any patch in the AArch64MIPeepholeOptimizer. And loading makes that more complex. The altered tests look like they are either intrinsics like the pmull64 that was recently changed or strange types (<32 x i1>, <16 x i12>) that don't come up a lot in practice. The global-isel change in dp1.ll is a missed fold that is probably better handled in global-isel. The irregular type tests also show it only handling straight loads, not instructions like ld1 { v1.s }[1], [x12]. The current model is to try and handle almost all combines during ISel so that they can all work together in one place.

I totally agree with the fact that loading/storing would make things more complex here, that global-isel should be able to handle dp1.ll; strange types (<16 x i2><32xi1>) looks more interesting but I doubt if it's a result of multiple issues.

If this comes up in more places it might be worth it. From the examples in the tests I'm not so sure though. Do you have any other examples where this helps?

I'm convinced it's better to handle affected test cases in ISel (yet still made changes, hope to make it closer to correct);

I don't observe other examples where this helps -> problem hunting starts from pmull test case (llvm/test/CodeGen/AArch64/aarch64-pmull2.ll in D131047)

I'll abandon this patch for now; thanks for your timely feedback on the series of changes!