This is an archive of the discontinued LLVM Phabricator instance.

[X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB
ClosedPublic

Authored by Carrot on May 5 2021, 7:43 PM.

Details

Summary

This patch transforms the sequence

    lea (reg1, reg2), reg3
    sub reg3, reg4

to two sub instructions

    sub reg1, reg4
    sub reg2, reg4

Similar optimization can also be applied to LEA/ADD sequence.
The modifications to TwoAddressInstructionPass is to ensure the operands of ADD instruction has expected order (the dest register of LEA should be src register of ADD).

Diff Detail

Unit TestsFailed

Event Timeline

Carrot created this revision.May 5 2021, 7:43 PM
Carrot requested review of this revision.May 5 2021, 7:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2021, 7:43 PM
RKSimon added inline comments.May 6 2021, 6:11 AM
llvm/include/llvm/CodeGen/TargetInstrInfo.h
462

Add doxygen description

llvm/lib/Target/X86/X86FixupLEAs.cpp
454
for (unsigned I = 0, E = CurInst->getNumOperands(); I != E; ++I)
llvm/lib/Target/X86/X86InstrInfo.cpp
2708

Reduce scope:

if (MachineOperand *Op = MRI.getOneDef(Reg1))
2718

Reduce scope:

if (MachineOperand *Op = MRI.getOneDef(Reg2))
llvm/test/CodeGen/X86/lea-opt2.ll
2

I think its ok to pre-commit this (with a suitable extra FIXME/TODO explanation comment) and then rebase the patch to show the diffs.

145

are the nsw relevant?

Carrot added inline comments.May 6 2021, 9:59 AM
llvm/test/CodeGen/X86/lea-opt2.ll
2

Will do.

145

It should not be relevant.

I think it is good when the patch's description not just what the patch does, but also why it does that.

Carrot added a comment.May 7 2021, 9:27 AM

I think it is good when the patch's description not just what the patch does, but also why it does that.

Same as other LEA -> ALU optimizations in optTwoAddrLEA. On old architectures such as HSW or SKL, there are less issue ports containing LEA function unit than ALU. So LEA instructions may be delayed due to issue port competition. On newer architectures such as ICL or TGL, all four ports can issue LEA instructions, there is no such problem, but it doesn't hurt, because all optimizations in optTwoAddrLEA don't generate extra instructions.

Carrot updated this revision to Diff 344288.May 10 2021, 11:13 PM
Carrot marked 3 inline comments as done.

Can you try this patch with http://llvm-compile-time-tracker.com/ ?

I studied the page for a while, but couldn't find the function to test the compile time of a patch, I can only find the comparison between two commits.

RKSimon added a subscriber: nikic.

https://llvm-compile-time-tracker.com/about.php

Assuming you have a public llvm fork on your github, @nikic can set you up - you'd need to create a branch with 'perf/' and it would get picked up automatically.

@xbolva00 what in particular are you concerned about? Avoiding LEA can be useful but I doubt there will be much effect here.

Whether we want to run this transformation on newer archs - as you said, it brings us nothing, so atleast we should check it is free in terms of compile time in the backend.

RKSimon added inline comments.May 12 2021, 2:36 AM
llvm/lib/Target/X86/X86FixupLEAs.cpp
420

Please can you fix all these case style warnings:

MachineOperand &Opnd = CurInst->getOperand(I);
Carrot updated this revision to Diff 344925.May 12 2021, 1:06 PM
Carrot marked an inline comment as done.
RKSimon added inline comments.May 14 2021, 7:18 AM
llvm/lib/Target/X86/X86FixupLEAs.cpp
403
const int InstrDistanceThreshold = 5;
Carrot updated this revision to Diff 345584.May 14 2021, 4:21 PM
Carrot marked an inline comment as done.
RKSimon accepted this revision.May 15 2021, 4:16 AM

LGTM cheers

This revision is now accepted and ready to land.May 15 2021, 4:16 AM
This revision was automatically updated to reflect the committed changes.

Just a heads up, my auto-bisecting multi-stage cron job has identified this change as the source of a second stage regression in a bunch of clang unit tests. I'm about to verify by hand, but it'll take a while.

Just a heads up, my auto-bisecting multi-stage cron job has identified this change as the source of a second stage regression in a bunch of clang unit tests. I'm about to verify by hand, but it'll take a while.

perf or correctness regression?

Lots of crashes like this:

FAIL: Clang-Unit :: Format/./FormatTests/FormatTest.FormatsCompactNamespaces (23265 of 76779)
******************** TEST 'Clang-Unit :: Format/./FormatTests/FormatTest.FormatsCompactNamespaces' FAILED ********************
Note: Google Test filter = FormatTest.FormatsCompactNamespaces
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FormatTest
[ RUN      ] FormatTest.FormatsCompactNamespaces
FormatTests: /home/dave/ro_s/lp/clang/lib/Basic/SourceManager.cpp:865: clang::FileID clang::SourceManager::getFileIDLoaded(unsigned int) const: Assertion `0 && "Invalid SLocOffset or bad function choice"' failed.
 #0 0x0000000000604313 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x604313)
 #1 0x0000000000602962 llvm::sys::RunSignalHandlers() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x602962)
 #2 0x0000000000604d8a SignalHandler(int) Signals.cpp:0:0
 #3 0x00007ffff7fa4a20 (/lib64/libpthread.so.0+0x13a20)
 #4 0x00007ffff7b392a2 raise (/lib64/libc.so.6+0x3d2a2)
 #5 0x00007ffff7b228a4 abort (/lib64/libc.so.6+0x268a4)
 #6 0x00007ffff7b22789 (/lib64/libc.so.6+0x26789)
 #7 0x00007ffff7b31a16 (/lib64/libc.so.6+0x35a16)
 #8 0x000000000064922c clang::SourceManager::getFileIDLoaded(unsigned int) const (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x64922c)
 #9 0x00000000003e734c clang::SourceManager::getDecomposedLoc(clang::SourceLocation) const CleanupTest.cpp:0:0
#10 0x000000000064b231 clang::SourceManager::isBeforeInTranslationUnit(clang::SourceLocation, clang::SourceLocation) const (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x64b231)
#11 0x000000000066c05e clang::format::AffectedRangeManager::computeAffectedLines(llvm::SmallVectorImpl<clang::format::AnnotatedLine*>&) (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x66c05e)
#12 0x00000000006ae393 clang::format::UsingDeclarationsSorter::analyze(clang::format::TokenAnnotator&, llvm::SmallVectorImpl<clang::format::AnnotatedLine*>&, clang::format::FormatTokenLexer&) (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x6ae393)
#13 0x0000000000690d1a clang::format::TokenAnalyzer::process() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x690d1a)
#14 0x0000000000665643 std::__1::__function::__func<clang::format::internal::reformat(clang::format::FormatStyle const&, llvm::StringRef, llvm::ArrayRef<clang::tooling::Range>, unsigned int, unsigned int, unsigned int, llvm::StringRef, clang::format::FormattingAttemptStatus*)::$_4, std::__1::allocator<clang::format::internal::reformat(clang::format::FormatStyle const&, llvm::StringRef, llvm::ArrayRef<clang::tooling::Range>, unsigned int, unsigned int, unsigned int, llvm::StringRef, clang::format::FormattingAttemptStatus*)::$_4>, std::__1::pair<clang::tooling::Replacements, unsigned int> (clang::format::Environment const&)>::operator()(clang::format::Environment const&) Format.cpp:0:0
#15 0x0000000000655d6e clang::format::internal::reformat(clang::format::FormatStyle const&, llvm::StringRef, llvm::ArrayRef<clang::tooling::Range>, unsigned int, unsigned int, unsigned int, llvm::StringRef, clang::format::FormattingAttemptStatus*) (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x655d6e)
#16 0x000000000065640f clang::format::reformat(clang::format::FormatStyle const&, llvm::StringRef, llvm::ArrayRef<clang::tooling::Range>, llvm::StringRef, clang::format::FormattingAttemptStatus*) (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x65640f)
#17 0x00000000003ef27f clang::format::(anonymous namespace)::FormatTest::format(llvm::StringRef, clang::format::FormatStyle const&, clang::format::(anonymous namespace)::FormatTest::StatusCheck) FormatTest.cpp:0:0
#18 0x0000000000407eed clang::format::(anonymous namespace)::FormatTest_FormatsCompactNamespaces_Test::TestBody() FormatTest.cpp:0:0
#19 0x000000000060cc53 testing::Test::Run() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x60cc53)
#20 0x000000000060dc16 testing::TestInfo::Run() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x60dc16)
#21 0x000000000060e470 testing::TestSuite::Run() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x60e470)
#22 0x000000000061bb23 testing::internal::UnitTestImpl::RunAllTests() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x61bb23)
#23 0x000000000061b59d testing::UnitTest::Run() (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x61b59d)
#24 0x000000000060555b main (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x60555b)
#25 0x00007ffff7b23b75 __libc_start_main (/lib64/libc.so.6+0x27b75)
#26 0x00000000003ddd8e _start (/tmp/_update_lc/t/tools/clang/unittests/Format/./FormatTests+0x3ddd8e)

(IOW the stage 2 is getting miscompiled)

Verified. Second stage miscompile. If it matters (and it sometimes does), my first stage is built without asserts.

This revision is now accepted and ready to land.May 19 2021, 7:07 AM
RKSimon requested changes to this revision.May 19 2021, 7:08 AM
This revision now requires changes to proceed.May 19 2021, 7:08 AM

Thanks! Also, I feel fairly confident about this. My auto-bisecting cron job is really paranoid. For example, it always does a multi-stage test to verify that the commit *before* the commit blamed by git-bisect actually works. That's what I was waiting for. Also, I then verified by trying to move one commit ahead in the history to this commit to get an A/B test and I then confirmed that this is the regression. Thanks again for reverting this.

Carrot updated this revision to Diff 347480.May 24 2021, 1:23 PM

Thanks for the report!

The problem is in

-       lea    (%rax,%rcx,1),%ebp
        mov    %esi,%edi
-       sub    %ebp,%edi
-       jbe    <_ZNK5clang7tooling12Replacements5mergeERKS1_+0x279>
+       sub    %eax,%edi
+       sub    %ecx,%edi
+       jbe    <_ZNK5clang7tooling12Replacements5mergeERKS1_+0x278>

Previously I thought X - (Y + Z) generates same flags as X - Y - Z, unfortunately it is not true when X - Y overflows.
In one execution of the code snippet, I got
rsi 0x3 3
rax 0x5 5
rcx 0x0 0
Before the transformation the last sub generates CF=1.
After the transformation the last sub generates CF=0.
So it causes wrong behavior for the following branch.

This patch should fix the problem.

@RKSimon, could you please take another look.

@davezarzycki, could you help to test this patch with your configuration?

thanks!

I can check it, but I'm a bit concerned about the comment and what it implies. The problem is NOT overflow, but that comparison operations need to be redistributed during this optimization (or dependent comparisons need to disable the optimization). So in pseudocode, the original code that miscompiled looks like this:

if (x - (y + z) <= 0) goto something; // where <= is a unsigned comparison but it really doesn't matter.

So redistributing the arithmetic requires redistributing the comparison.

auto t1 = x - y
if (t1 <= 0) goto something;
if (t1 - z <= 0) goto something;

Does the new patch redistribute comparisons?

Actually, it is more complicated, and you're right that it also can involve overflow. I'm not sure if this is a trivial optimization when dependent comparisons are involved.

Actually, it is more complicated, and you're right that it also can involve overflow. I'm not sure if this is a trivial optimization when dependent comparisons are involved.

The new changes in function searchALUInst exactly does this check.

// X - (Y + Z) may generate different flags than (X - Y) - Z when there
// is overflow. So we can't change the alu instruction if the flags
// register is live.
MachineBasicBlock::iterator NextI = std::next(CurInst);
if (MBB.computeRegisterLiveness(TRI, X86::EFLAGS, NextI) !=
    MachineBasicBlock::LQR_Dead)
  return MachineBasicBlock::iterator();

If the flags of SUB/ADD is live, it will be used by following comparison or conditional branch or other instructions, we will give up the transformation.

In other words, only when the flags of the ALU instruction will not be used by following instructions (comparison or conditional branch), then we can do the transformation.

I've verified that this patch no longer causes my multi-stage cron job to regress. Thanks!

craig.topper added inline comments.May 26 2021, 10:18 AM
llvm/lib/Target/X86/X86FixupLEAs.cpp
437

Why can't we just check the Dead flag on the EFLAGS def of the ALU op?

Carrot updated this revision to Diff 348085.May 26 2021, 1:59 PM

Check the dead EFLAGS def directly instead of calling computeRegisterLiveness.
Add implicit dead def EFLAGS to new alu instructions.

llvm/lib/Target/X86/X86FixupLEAs.cpp
437

Good catch!
I copied the code from optTwoAddrLEA.

craig.topper added inline comments.May 26 2021, 5:04 PM
llvm/lib/Target/X86/X86FixupLEAs.cpp
437

Can we use MachineInstr::registerDefIsDead

515

This creates a second EFLAGS implicit def. The BuildMI would have already created one based on the MCInstrDesc. I checked the print-after-all output and saw this on the lea-opt2.ll test1.

$ecx = SUB32rr $ecx(tied-def 0), $edx, implicit-def $eflags, implicit-def dead $eflags                                                                                                                 
$ecx = SUB32rr $ecx(tied-def 0), $eax, implicit-def $eflags, implicit-def dead $eflags

I added -verify-machineinstrs, but it doesn't detect this as an issue. That probably doesn't work well with findRegisterDefOperandIdx.

I think what you want to do is this creating the instruction.

NewMI1->addRegisterDead(X86::EFLAGS, TRI);

That will scan the operands and mark the existing EFLAGS operand as dead.

Carrot updated this revision to Diff 348161.May 26 2021, 10:06 PM
Carrot marked 2 inline comments as done.
craig.topper added inline comments.May 27 2021, 11:04 AM
llvm/lib/Target/X86/X86InstrInfo.cpp
2707

I think you might be able to use getUniqueVRegDef which will return the MachineInstr directly

Carrot updated this revision to Diff 348354.May 27 2021, 12:24 PM
Carrot marked an inline comment as done.
RKSimon accepted this revision.May 28 2021, 3:16 AM

LGTM to unblock

This revision is now accepted and ready to land.May 28 2021, 3:16 AM
This revision was automatically updated to reflect the committed changes.

We are still observing a miscompile caused by the newer version of this patch. I'll try to provide you a repro.

Filed https://bugs.llvm.org/show_bug.cgi?id=50615. The original repro is in Java, but I hope the logs there are enough to figure out what's going on. The log shows that the described transform did apply.

I suggest to revert and then investigate.

We are still observing a miscompile caused by the newer version of this patch. I'll try to provide you a repro.

Surely you can provide the .ll after middle-end optimizations?

mkazantsev added a comment.EditedJun 8 2021, 3:20 AM

We are still observing a miscompile caused by the newer version of this patch. I'll try to provide you a repro.

Surely you can provide the .ll after middle-end optimizations?

Attached repro runnable with upstream llc to bug. Seems that it reproduces the same effect.

Carrot added a comment.Jun 8 2021, 8:47 AM

Thanks for the report!
I guess function searchALUInst should also check overlapped register usage of DestReg.

I'm a bit surprised that it wasn't reverted. The default policy is revert whatever is broken and then fix it. We have numerous internal failures & broken CI cycles because of this, and the fix is not merged yet (and not clear when it will be merged). @Carrot could you please revert this and return back along with fix?

fhahn added a subscriber: fhahn.Jun 11 2021, 6:50 AM

I also suspect this patch is causing the following failure (with -mllvm -verify-machineinstrs ) on GreenDragon (http://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64-O3/9579/)

/Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64-O3/test-suite-build/tools/timeit --summary MultiSource/Applications/JM/ldecod/CMakeFiles/ldecod.dir/block.c.o.time /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64-O3/compiler/bin/clang -DNDEBUG  -B /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin    -Wno-unused-command-line-argument -mllvm -verify-machineinstrs -O3 -arch x86_64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.15.sdk   -w -Werror=date-time -fcommon -D__USE_LARGEFILE64 -D_FILE_OFFSET_BITS=64 -MD -MT MultiSource/Applications/JM/ldecod/CMakeFiles/ldecod.dir/block.c.o -MF MultiSource/Applications/JM/ldecod/CMakeFiles/ldecod.dir/block.c.o.d -o MultiSource/Applications/JM/ldecod/CMakeFiles/ldecod.dir/block.c.o   -c /Users/buildslave/jenkins/workspace/test-suite-verify-machineinstrs-x86_64-O3/test-suite/MultiSource/Applications/JM/ldecod/block.c
Fatal Error: error in backend: Found 2 machine code errors.
*** Bad machine code: Using an undefined physical register ***
- function:    itrans_sp
- basic block: %bb.0 entry (0x7fe93e957ac8)
- instruction: $ecx = SUB32rr $ecx(tied-def 0), $esi, implicit-def dead $eflags
- operand 2:   $esi
fatal error: error in backend: Found 2 machine code errors.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
Error: clang frontend command failed with exit code 70 (use -v to see invocation)
25 clang         0x000000011115a63d clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&) const + 221
26 clang         0x000000011115ab9d clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::__1::pair<int, clang::driver::Command const*> >&) const + 125
27 clang         0x00000001111715ec clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::__1::pair<int, clang::driver::Command const*> >&) + 204
28 clang         0x000000010e89a145 main + 10309
29 libdyld.dylib 0x00007fff67a6fcc9 start + 1
clang-13: error: clang frontend command failed with exit code 70 (use -v to see invocation)
clang version 13.0.0 (https://github.com/llvm/llvm-project.git f0a68bbc967ab851e9b678feaf9015a2bfadb12e)
Target: x86_64-apple-darwin19.5.0

It would be great if you could take a look and revert the patch if the investigation will take longer.

fhahn added a comment.Jun 12 2021, 3:45 AM

The fix has been committed as f35bcea1d4748889b8240defdf00cb7a71cbe070.

Unfortunately this still creates invalid Machine IR for the llvm-test-suite on X86, as I mentioned above. E.g. see http://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64-O3/9585/

I reverted the fix and the patch in 5cd66420ccb1, 1b748faf2bae to get the public bots back to green.

To reproduce, llc -verify-machineinstrs on the IR below on X86:

target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

%struct.widget = type { i32, i32, i32, i32, i32*, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [16 x [16 x i16]], [6 x [32 x i32]], [16 x [16 x i32]], [4 x [12 x [4 x [4 x i32]]]], [16 x i32], i8**, i32*, i32***, i32**, i32, i32, i32, i32, %struct.baz*, %struct.wobble.1*, i32, i32, i32, i32, i32, i32, %struct.quux.2*, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x i32], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32***, i32***, i32****, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x [2 x i32]], [3 x [2 x i32]], i32, i32, i64, i64, %struct.zot.3, %struct.zot.3, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }
%struct.baz = type { i32, i32, i32, i32, i32, i32, i32, i32, i32, %struct.snork*, %struct.wombat.0*, %struct.wobble*, i32, i32*, i32*, i32*, i32, i32*, i32*, i32*, i32 (%struct.widget*, %struct.eggs*)*, i32, i32, i32, i32 }
%struct.snork = type { %struct.spam*, %struct.zot, i32 (%struct.wombat*, %struct.widget*, %struct.snork*)* }
%struct.spam = type { i32, i32, i32, i32, i8*, i32 }
%struct.zot = type { i32, i32, i32, i32, i32, i8*, i32* }
%struct.wombat = type { i32, i32, i32, i32, i32, i32, i32, i32, void (i32, i32, i32*, i32*)*, void (%struct.wombat*, %struct.widget*, %struct.zot*)* }
%struct.wombat.0 = type { [4 x [11 x %struct.quux]], [2 x [9 x %struct.quux]], [2 x [10 x %struct.quux]], [2 x [6 x %struct.quux]], [4 x %struct.quux], [4 x %struct.quux], [3 x %struct.quux] }
%struct.quux = type { i16, i8 }
%struct.wobble = type { [2 x %struct.quux], [4 x %struct.quux], [3 x [4 x %struct.quux]], [10 x [4 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]] }
%struct.eggs = type { [1000 x i8], [1000 x i8], [1000 x i8], i32, i32, i32, i32, i32, i32, i32, i32 }
%struct.wobble.1 = type { i32, [2 x i32], i32, i32, %struct.wobble.1*, %struct.wobble.1*, i32, [2 x [4 x [4 x [2 x i32]]]], i32, i64, i64, i32, i32, [4 x i8], [4 x i8], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }
%struct.quux.2 = type { i32, i32, i32, i32, i32, %struct.quux.2* }
%struct.zot.3 = type { i64, i16, i16, i16 }

define void @blam(%struct.widget* %arg, i32 %arg1) local_unnamed_addr {
bb:
  %tmp = load i32, i32* undef, align 4
  %tmp2 = sdiv i32 %tmp, 6
  %tmp3 = sdiv i32 undef, 6
  %tmp4 = load i32, i32* undef, align 4
  %tmp5 = icmp eq i32 %tmp4, 4
  %tmp6 = select i1 %tmp5, i32 %tmp3, i32 %tmp2
  %tmp7 = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]]* undef, i64 0, i64 0, i64 0
  %tmp8 = zext i16 undef to i32
  %tmp9 = zext i16 undef to i32
  %tmp10 = load i16, i16* undef, align 2
  %tmp11 = zext i16 %tmp10 to i32
  %tmp12 = zext i16 undef to i32
  %tmp13 = zext i16 undef to i32
  %tmp14 = zext i16 undef to i32
  %tmp15 = load i16, i16* undef, align 2
  %tmp16 = zext i16 %tmp15 to i32
  %tmp17 = zext i16 undef to i32
  %tmp18 = sub nsw i32 %tmp8, %tmp9
  %tmp19 = shl nsw i32 undef, 1
  %tmp20 = add nsw i32 %tmp19, %tmp18
  %tmp21 = sub nsw i32 %tmp11, %tmp12
  %tmp22 = shl nsw i32 undef, 1
  %tmp23 = add nsw i32 %tmp22, %tmp21
  %tmp24 = sub nsw i32 %tmp13, %tmp14
  %tmp25 = shl nsw i32 undef, 1
  %tmp26 = add nsw i32 %tmp25, %tmp24
  %tmp27 = sub nsw i32 %tmp16, %tmp17
  %tmp28 = shl nsw i32 undef, 1
  %tmp29 = add nsw i32 %tmp28, %tmp27
  %tmp30 = sub nsw i32 %tmp20, %tmp29
  %tmp31 = sub nsw i32 %tmp23, %tmp26
  %tmp32 = shl nsw i32 %tmp30, 1
  %tmp33 = add nsw i32 %tmp32, %tmp31
  store i32 %tmp33, i32* undef, align 4
  %tmp34 = mul nsw i32 %tmp31, -2
  %tmp35 = add nsw i32 %tmp34, %tmp30
  store i32 %tmp35, i32* undef, align 4
  %tmp36 = select i1 %tmp5, i32 undef, i32 undef
  br label %bb37

bb37:                                             ; preds = %bb
  %tmp38 = load i32, i32* undef, align 4
  %tmp39 = ashr i32 %tmp38, %tmp6
  %tmp40 = load i32, i32* undef, align 4
  %tmp41 = sdiv i32 %tmp39, %tmp40
  store i32 %tmp41, i32* undef, align 4
  ret void
}

Thanks for the test case!
The problem is:

-- Before transformation
  renamable $esi = LEA64_32r renamable $rcx, 1, renamable $rcx, 0, $noreg
  renamable $edi = LEA64_32r renamable $rcx, 2, renamable $rcx, 0, $noreg
  $eax = MOV32rr $ecx, implicit killed $rcx                                                               // $rcx is killed at here 
  renamable $eax = SUB32rr killed renamable $eax(tied-def 0), killed renamable $esi, implicit-def dead $eflags

-- After transformation
  renamable $edi = LEA64_32r renamable $rcx, 2, renamable $rcx, 0, $noreg
  $eax = MOV32rr $ecx, implicit killed $rcx                                                               // $rcx should not be killed at here
  $eax = SUB32rr $eax(tied-def 0), $ecx, implicit-def dead $eflags
  $eax = SUB32rr $eax(tied-def 0), $ecx, implicit-def dead $eflags                         // $rcx should be killed at here

The transformation may extend the life range of original BaseReg and IndexReg like $rcx in this test case, so the original kill flag should be cleared, and new kill flag should be added in new instruction.