This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Merge LDRSWpre-LD[U]RSW pair into LDPSWpre.
ClosedPublic

Authored by chaosdefinition on Jun 7 2023, 4:18 PM.

Details

Summary

This patch optimizes a pair of LDRSWpre and LDRSWui (or LDURSWi) instructions into a single LDPSWpre instruction. This is a missing case in D99272.

MIR test cases in D152564 are updated to verify the optimization.

Diff Detail

Event Timeline

chaosdefinition created this revision.Jun 7 2023, 4:18 PM
chaosdefinition requested review of this revision.Jun 7 2023, 4:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2023, 4:18 PM
fhahn added inline comments.Jun 8 2023, 4:02 AM
llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
608

I think we should also have a test case with LDURSWi and variants where LDRSWpre is the second merge candidate. Preferably the tests will be submitted separately first.

chaosdefinition added inline comments.Jun 8 2023, 4:36 PM
llvm/test/CodeGen/AArch64/ldrpre-ldr-merge.mir
608

Thanks for reviewing! By submitting the tests first, I suppose merge cases should be no merge to avoid test failures and then updated to merge here, correct?

chaosdefinition retitled this revision from [AArch64] Merge LDRSWpre-LDRSWui pair into LDPSWpre. to [AArch64] Merge LDRSWpre-LD[U]RSW pair into LDPSWpre..
chaosdefinition edited the summary of this revision. (Show Details)

Addressing comments by @fhahn:

  • Add a test case of merging LDRSWpre and LDURSWi
  • Add test cases with LDRSWpre as the second merge candidate
  • Split test cases into a separate patch (see D152564)

Can you add a test where getMatchingNonSExtOpcode would be true, and we could turn into a ldp+sext? I'm not sure if it will actually happen at the moment with pre nodes. LDRSWpre + LDRWui and maybe LDRWpre + LDRSWui

Update the patch to be based on the updated D152564, no functionality change.

Can you add a test where getMatchingNonSExtOpcode would be true, and we could turn into a ldp+sext? I'm not sure if it will actually happen at the moment with pre nodes. LDRSWpre + LDRWui and maybe LDRWpre + LDRSWui

Addressed in D152564. Indeed these are currently not merged; perhaps fix them in the future?

dmgreen accepted this revision.Jul 18 2023, 12:31 AM

Sorry for the delay. This fell off my radar. The changes LGTM, as far as I can see.

This revision is now accepted and ready to land.Jul 18 2023, 12:31 AM

Heads-up - this caused multiple segmentation faults in our code base. We are working on a reproducer.

scw added a subscriber: scw.EditedJul 25 2023, 3:56 PM

We see a miscompilation after this patch in protobuf under MSan. This is the line where the miscompilation happens, and I extracted it to this godbolt with compilation option -fsanitize=memory -O.

Before this patch, the AArch64 assembly is (registers are renamed a bit, because this is disassembled from a binary I built internally)

1bfbbc64: b8808d49      ldrsw   x9, [x10, #8]!
1bfbbc68: ca0d014e      eor     x14, x10, x13
1bfbbc6c: b8404d4c      ldr     w12, [x10, #4]!
1bfbbc70: ca0d014d      eor     x13, x10, x13

After, it became

1bfbb51c: 29c13149      ldp     w9, w12, [x10, #8]!
1bfbb520: 93407d4a      sxtw    x10, w10
1bfbb524: ca0d014e      eor     x14, x10, x13
1bfbb528: ca0d014d      eor     x13, x10, x13

There are two problems:

  1. The after sequence touches x10, before calculating x13 and x14, which are supposed to be the addresses of the shadow memory. This caused the crash because our x10 was pointing to stack where the 32-th bit is 1. I believe that line should have been sxtw x9, w9 instead, to perform the extension done in ldrsw in the before sequence.
  2. This also miscalculates x13, where in the before sequence, after execution, it holds old(x13) xor (x10 + 12), yet after the "after" sequence, it holds old(x13) xor (x10 + 8), the same as x14.

Heads-up - this caused multiple segmentation faults in our code base. We are working on a reproducer.

We see a miscompilation after this patch in protobuf under MSan. This is the line where the miscompilation happens, and I extracted it to this godbolt with compilation option -fsanitize=memory -O.

Before this patch, the AArch64 assembly is (registers are renamed a bit, because this is disassembled from a binary I built internally)

1bfbbc64: b8808d49      ldrsw   x9, [x10, #8]!
1bfbbc68: ca0d014e      eor     x14, x10, x13
1bfbbc6c: b8404d4c      ldr     w12, [x10, #4]!
1bfbbc70: ca0d014d      eor     x13, x10, x13

After, it became

1bfbb51c: 29c13149      ldp     w9, w12, [x10, #8]!
1bfbb520: 93407d4a      sxtw    x10, w10
1bfbb524: ca0d014e      eor     x14, x10, x13
1bfbb528: ca0d014d      eor     x13, x10, x13

There are two problems:

  1. The after sequence touches x10, before calculating x13 and x14, which are supposed to be the addresses of the shadow memory. This caused the crash because our x10 was pointing to stack where the 32-th bit is 1. I believe that line should have been sxtw x9, w9 instead, to perform the extension done in ldrsw in the before sequence.
  2. This also miscalculates x13, where in the before sequence, after execution, it holds old(x13) xor (x10 + 12), yet after the "after" sequence, it holds old(x13) xor (x10 + 8), the same as x14.

Thanks for reporting and reproducing the miscompilation! It looks like two pre-indexed loads LDRSWpre and LDRWpre got merged into a LDPWpre and SBFMXri, which is an unintended result by this patch. I think we can either rollback this patch or apply a quick fix that disallows merging two pre-indexed loads like the following. Any thoughts?

diff --git a/llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp b/llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
index 419b471db3a3..e64b1d0658f0 100644
--- a/llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
+++ b/llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
@@ -1317,20 +1317,24 @@ static bool areCandidatesToMergeOrPair(MachineInstr &FirstMI, MachineInstr &MI,
                                   MI.getFlag(MachineInstr::FrameDestroy)))
     return false;

   unsigned OpcA = FirstMI.getOpcode();
   unsigned OpcB = MI.getOpcode();

   // Opcodes match: If the opcodes are pre ld/st there is nothing more to check.
   if (OpcA == OpcB)
     return !AArch64InstrInfo::isPreLdSt(FirstMI);

+  // Two pre ld/st of different opcodes cannot be merged either
+  if (AArch64InstrInfo::isPreLdSt(FirstMI) && AArch64InstrInfo::isPreLdSt(MI))
+    return false;
+
   // Try to match a sign-extended load/store with a zero-extended load/store.
   bool IsValidLdStrOpc, PairIsValidLdStrOpc;
   unsigned NonSExtOpc = getMatchingNonSExtOpcode(OpcA, &IsValidLdStrOpc);
   assert(IsValidLdStrOpc &&
          "Given Opc should be a Load or Store with an immediate");
   // OpcA will be the first instruction in the pair.
   if (NonSExtOpc == getMatchingNonSExtOpcode(OpcB, &PairIsValidLdStrOpc)) {
     Flags.setSExtIdx(NonSExtOpc == (unsigned)OpcA ? 1 : 0);
     return true;
   }

Thanks for reporting and reproducing the miscompilation! It looks like two pre-indexed loads LDRSWpre and LDRWpre got merged into a LDPWpre and SBFMXri, which is an unintended result by this patch. I think we can either rollback this patch or apply a quick fix that disallows merging two pre-indexed loads like the following. Any thoughts?

Hi there! I'm not an expert here, but speaking with @scw and he thinks the fix doesn't address the "sxtw on wrong register bug". He'll probably be able to comment with details later but his vote is to rollback for now. Can you please do ?

Thanks

alexfh added a subscriber: alexfh.Jul 26 2023, 6:42 AM

Thanks for reporting and reproducing the miscompilation! It looks like two pre-indexed loads LDRSWpre and LDRWpre got merged into a LDPWpre and SBFMXri, which is an unintended result by this patch. I think we can either rollback this patch or apply a quick fix that disallows merging two pre-indexed loads like the following. Any thoughts?

Hi there! I'm not an expert here, but speaking with @scw and he thinks the fix doesn't address the "sxtw on wrong register bug". He'll probably be able to comment with details later but his vote is to rollback for now. Can you please do ?

Thanks

Sent a revert: https://reviews.llvm.org/D156328 (will commit after the precommit checks finish).

chaosdefinition reopened this revision.Jul 26 2023, 2:25 PM
chaosdefinition added inline comments.
llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
1338

Hi there! I'm not an expert here, but speaking with @scw and he thinks the fix doesn't address the "sxtw on wrong register bug". He'll probably be able to comment with details later but his vote is to rollback for now. Can you please do ?

Thanks

The wrong register in SBFMXri was resulted from SExtIdx being set to 0 here. However, to reach here:

  • OpcA and OpcB must be different;
  • OpcA and OpcB must have the same NonSExtOpcode.

So this code path works only for pairs of LDRSWui-LDRWui and LDURSWi-LDURWi before.

Now with this patch, the only added case to reach here is LDRSWpre-LDRWpre pairs, as was caught by the miscompilation. So my quick fix should prevent such pairs from generating SBFMXri, by bailing out early on two pre-indexed loads.

I guess there might be a better way to fix this. Any thoughts while I work on an updated patch?

Thanks!

This revision is now accepted and ready to land.Jul 26 2023, 2:25 PM

Reverting in the sort term does sounds like a good idea to make sure LLVM 17 is in a good shape. We can then work on getting this recommitted with the fix.

Hi, I'm reopening this with the following updates:

  • Apply the quick fix described in D152407#4534022 that disallows merging two pre-indexed loads, assuming there's no better approach.
  • Add MIR regression tests modified from @scw's reproduction .

Thanks!

scw added a comment.Aug 29 2023, 10:47 AM

Hi, I'm reopening this with the following updates:

  • Apply the quick fix described in D152407#4534022 that disallows merging two pre-indexed loads, assuming there's no better approach.
  • Add MIR regression tests modified from @scw's reproduction .

Thanks!

Test case LGTM, I don't have enough knowledge for the code to know if the proposed fix is the best way though.

Also, should this change reuse the same D number or should we create a new one?

Test case LGTM, I don't have enough knowledge for the code to know if the proposed fix is the best way though.

Also, should this change reuse the same D number or should we create a new one?

I don't know what the standard way for this is. I'm inclined to reuse the D number so to keep the discussion in this thread, but opening up a new thread is fine as well.

dmgreen accepted this revision.Sep 10 2023, 2:53 AM

The same revision should be OK, if the change is just a fix to the existing patch.

I had a look through the LoadStoreOptimizer. If you don't think there are any other cases this can go wrong them it LGTM.

The same revision should be OK, if the change is just a fix to the existing patch.

I had a look through the LoadStoreOptimizer. If you don't think there are any other cases this can go wrong them it LGTM.

Thanks. I don't see any other cases that can go wrong with this. Will commit shortly.