This is an archive of the discontinued LLVM Phabricator instance.

[RegisterCoalescer] Delay live interval update work until the rematerialization for all the uses from the same def is done
ClosedPublic

Authored by wmi on Jul 18 2018, 4:30 PM.

Details

Summary

We run into a compile time problem with flex generated code combined with -fno-jump-tables. The cause is that machineLICM hoists a lot of invariants outside of a big loop, and drastically increases the compile time in global register splitting and copy coalescing. https://reviews.llvm.org/D49353 relieves the problem in global splitting. This patch is to handle the problem in copy coalescing.

About the situation where the problem in copy coalescing happens. After machineLICM, we have several defs outside of a big loop with hundreds or thousands of uses inside the loop. Rematerialization in copy coalescing happens for each use and everytime rematerialization is done, shrinkToUses will be called to update the huge live interval. Because we have 'n' uses for a def, and each live interval update will have at least 'n' complexity, the total update work is n^2.

To fix the problem, we try to do the live interval update work in a collective way. If a def has many copylike uses larger than a threshold, each time rematerialization is done for one of those uses, we won't do the live interval update in time but delay that work until rematerialization for all those uses are completed, so we only have to do the live interval update work once.

Delaying the live interval update could potentially change the copy coalescing result, so we hope to limit that change to those defs with many (like above a hundred) copylike uses.

I am running internal performance testing at the same time. I also run stress testing using clang bootstrap by setting the threshold to 0, i.e., delay all the live interval update work after rematerialization, and it passes.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Jul 18 2018, 4:30 PM

Nice approach!

I'd love to know if the old pattern has any real advantages. It would seem much simpler to just always defer the interval update until all the copies have been coalesced. Maybe its worth collecting perf data there too? If we can get away with the simpler (non-heuristic) model, it seems better.

Having invalid intermediate live ranges seems scary, and I'm not naturally convinced this is fine. How much have you tested this change yet?

wmi added a comment.Jul 19 2018, 8:40 AM

Having invalid intermediate live ranges seems scary, and I'm not naturally convinced this is fine. How much have you tested this change yet?

The invalid live ranges are postponing to shrink some live ranges, so it is a conservative approach and shouldn't affect correctness, right? About the test, I set LateRematUpdateThreshold by default to 0 and bootstrap clang, then run through internal performance test. No correctness regression found.

Although there is performance perturbation in performance testing, that is within our expectation because failing to shrink live range in time could potentially hurt performance. That is why I choose to apply the change with a threshold.

wmi added a comment.Jul 20 2018, 2:08 PM

Did two large server testing by setting LateRematUpdateThreshold to 0. No correctness issues found.

I think I hit the compile time problem mentioned here too. Some files in our project can take 5-10 minutes to compile (depending on the machine compiling) and 80-85% of the time is spent in shrinkToUses in the registercoalescer.

A part that I'm interested into is if you know how changing this affects for example "resolveConflicts()" and "mapValue()". I don't have much experience with the RegisterCoalescer code, but from what I see those functions use LiveInterval information to check if values overlap (for example). I was wondering if we are sure that it is not affecting any of that from a correctness standpoint. From what I understand the delay is applied only if the rematerialization happens on a value used by a lot of coalesceable copies.
Have you tried running all your tests with the delay threshold set to the minimum , so that it is always delayed no matter what even if it used by a low number of copies?

Oh, I didn’t see the message where you said you tried. In that case disregard my question about testing with a lower threshold

MatzeB accepted this revision.Aug 2 2018, 4:54 PM

I think we can go ahead and commit to see how it fares in tree.

This revision is now accepted and ready to land.Aug 2 2018, 4:54 PM
This revision was automatically updated to reflect the committed changes.

Hello @wmi, in our local fuzzing testing we've got the following assert:
src/include/llvm/ADT/IntervalMap.h:630: unsigned int llvm::IntervalMapImpl::LeafNode<KeyT, ValT, N, Traits>::insertFrom(unsigned int&, unsigned int, KeyT, KeyT, ValT) [with KeyT = llvm::SlotIndex; ValT = unsigned int; unsigned int N = 9u; Traits = llvm::IntervalMapInfo<llvm::SlotIndex>]: Assertion `!Traits::stopLess(b, a) && "Invalid interval"' failed.

Triage points to this patch. Revert of this patch eliminates an assert. Also if I specify -late-remat-update-threshold=100000 test also passes.
Unfortunately at this moment I cannot create a LLVM reproducer for this bug. Also This bug is not reproducible on the current trunk however I have a suspicion that it is just hided not fixed.
I continue investigation.

However if you have some insides or suggestion what it could be or what other commits could fix that bug, could you please share it, it would probably save me a lot of time to narrowing it down.

The stack trace of the crash looks like:
#0 0x00007ffff700c5d7 in raise () from /lib64/libc.so.6
#1 0x00007ffff700dcc8 in abort () from /lib64/libc.so.6
#2 0x00007ffff7005546 in assert_fail_base () from /lib64/libc.so.6
#3 0x00007ffff70055f2 in
assert_fail () from /lib64/libc.so.6
#4 0x00007ffff13b334f in llvm::IntervalMapImpl::LeafNode<llvm::SlotIndex, unsigned int, 9u, llvm::IntervalMapInfo<llvm::SlotIndex> >::insertFrom (this=this@entry=0x7fff4811c0b8, Pos=@0x7fff5d8d4920: 2,

Size=2, a=..., a@entry=..., b=..., y=<optimized out>) at include/llvm/ADT/IntervalMap.h:630

#5 0x00007ffff13bd56d in llvm::IntervalMap<llvm::SlotIndex, unsigned int, 9u, llvm::IntervalMapInfo<llvm::SlotIndex> >::insert (this=this@entry=0x7fff4811c0b8, a=a@entry=..., b=..., b@entry=...,

y=<optimized out>) at include/llvm/ADT/IntervalMap.h:1092

#6 0x00007ffff13bd8ef in llvm::SplitEditor::useIntv (this=this@entry=0x7fff4811bff0, Start=Start@entry=..., End=...)

at lib/CodeGen/SplitKit.cpp:754

#7 0x00007ffff13beeee in llvm::SplitEditor::splitSingleBlock (this=0x7fff4811bff0, BI=...) at lib/CodeGen/SplitKit.cpp:1580
#8 0x00007ffff13297c7 in splitAroundRegion (UsedCands=..., LREdit=..., this=0x7fff482fd010)

at lib/CodeGen/RegAllocGreedy.cpp:1685

#9 (anonymous namespace)::RAGreedy::doRegionSplit (this=0x7fff482fd010, VirtReg=..., BestCand=6, HasCompact=<optimized out>, NewVRegs=...)

at lib/CodeGen/RegAllocGreedy.cpp:1968

#10 0x00007ffff1335b49 in tryRegionSplit (NewVRegs=..., Order=..., VirtReg=..., this=<optimized out>)

at lib/CodeGen/RegAllocGreedy.cpp:1832

#11 trySplit (NewVRegs=..., Order=..., VirtReg=..., this=<optimized out>) at lib/CodeGen/RegAllocGreedy.cpp:2453
#12 (anonymous namespace)::RAGreedy::selectOrSplitImpl (this=0x7fff482fd010, VirtReg=..., NewVRegs=..., FixedRegisters=..., Depth=1, Depth@entry=0)

at lib/CodeGen/RegAllocGreedy.cpp:3052

#13 0x00007ffff1335f77 in (anonymous namespace)::RAGreedy::selectOrSplit (this=0x7fff482fd010, VirtReg=..., NewVRegs=...)

at lib/CodeGen/RegAllocGreedy.cpp:2732

#14 0x00007ffff131905e in llvm::RegAllocBase::allocatePhysRegs (this=0x7fff482fd078, this@entry=0x7ffff3e19006 <llvm::MachineModuleInfo::ID>)

at lib/CodeGen/RegAllocBase.cpp:113

#15 0x00007ffff132f7b1 in (anonymous namespace)::RAGreedy::runOnMachineFunction (this=0x7fff482fd010, mf=...)

at lib/CodeGen/RegAllocGreedy.cpp:3207

#16 0x00007ffff123b6a5 in llvm::MachineFunctionPass::runOnFunction (this=0x7fff482fd010, F=...)

at lib/CodeGen/MachineFunctionPass.cpp:61

#17 0x00007ffff1061957 in llvm::FPPassManager::runOnFunction (this=0x7fff480bd530, F=...) at lib/IR/LegacyPassManager.cpp:1586

I was able to create a pure LLVM reproducer for this issue and filed a bug https://bugs.llvm.org/show_bug.cgi?id=40061.
Could you please take a look into it?