This is an archive of the discontinued LLVM Phabricator instance.

[JumpThreading] Reverse the order of basic block iteration.
AbandonedPublic

Authored by mnadeem on Oct 3 2022, 9:08 PM.

Details

Summary

This patch is a fix for a compile time issue I was seeing in
SPEC2017/cam4, reducing the monstrous compile time for
one file from 80+min to under 10sec.

The test changes seem reasonable, although I am not too sure
about the change in LazyValueAnalysis dump as I am not familiar with it.

Essentially the file I was compiling had many lines of fortran code of this
form: arr(:ncol,:) = 0. The 2d array dimensions in this case are statically
known.

When compiling with flang-new this is converted into a nested loop
with a store, and just before Jump Threading is run we have tens of
thousands of branches containing GEP + memsets and memcpys and each
threadable chain is very long as well.

With the current top down approach 90% of the time was spent
in renaming non-local uses of instructions in updateSSA(). Profiling
showed that about half that time was spent in the
SSAUpdater::FindAvailableVals() --> FindExistingPHI() call. This is
because we were accumulating new PHIs as we kept on threading the
successors BBs.

I was not able to reduce a test case showing the high compile time but
the file is cldwat2m_macro.f90 and I compiled with O3.

Diff Detail

Event Timeline

mnadeem created this revision.Oct 3 2022, 9:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2022, 9:08 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
mnadeem requested review of this revision.Oct 3 2022, 9:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2022, 9:08 PM
mnadeem added a subscriber: nikic.

@nikic I dont have the framework to measure more general compile time impact. Could you please help in getting this tested on https://llvm-compile-time-tracker.com/ ?

I'd prefer to try to prevent compile-time explosions in a more reliable way; messing with iteration order doesn't seem like it will work in all cases.

Even if we do want to mess with the iteration order, just reversing the basic block list isn't really reliable. See llvm/ADT/PostOrderIterator.h

I'd prefer to try to prevent compile-time explosions in a more reliable way; messing with iteration order doesn't seem like it will work in all cases.

I thought about alternate ways to reduce the compile time but could not come up with anything.

Any ideas are appreciated! Here is snippet of relevant IR https://gist.github.com/UsmanNadeem/5543f14020b654f6c1f1d9dc5dc6ab50 .

When JT is called on ._crit_edge2292, %uglygep.13072 is replicated in the threaded block and a PHI node (lets say %phi.1) is inserted into the successor i.e. ._crit_edge2292.1. In the next iteration JT is called on ._crit_edge2292.1 and we now need to build two PHI nodes when rewriting the SSA. One for the new value %uglygep.23073 and one for %phi.1 that we created earlier. The number of new phi nodes we need to build keeps adding up as we go down the chain.

If we iterate from the bottom up then then we do not have this issue by design.

messing with iteration order doesn't seem like it will work in all cases.

I agree, but it would be helpful if we could get compile-time data for other benchmarks so that we have the whole picture.
On another note, I did run a performance experiment and did not see any regression and saw a perf 3% gain in SPEC2017/xz. And look at the test output changes I do see some more threading.

Even if we do want to mess with the iteration order, just reversing the basic block list isn't really reliable. See llvm/ADT/PostOrderIterator.h

A lot of tests are throwing Iterator index out of bound error with the post order iterator. I think that the JT pass modifies the IR in a way that is messing with the PO iterator. I could break the loop on every change to get around this but it would not be efficient, so not sure if just reversing the list could work here...?

Please let me know your thoughts.

nikic added a comment.Oct 6 2022, 3:09 AM

CTMark results: http://llvm-compile-time-tracker.com/compare.php?from=8d569e638b9e960b4ab635d780e60e8fa0b049f7&to=5b6c4c8d92da85384942a0dafb232bbc29c82c01&stat=instructions

Worth noting that LVI result quality is unfortunately dependent on query order (when querying without block-local facts, it is preferred to query the loop header before the loop latch -- though probably when querying with block-local results the reverse is true).

When JT is called on ._crit_edge2292, %uglygep.13072 is replicated in the threaded block and a PHI node (lets say %phi.1) is inserted into the successor i.e. ._crit_edge2292.1. In the next iteration JT is called on ._crit_edge2292.1 and we now need to build two PHI nodes when rewriting the SSA. One for the new value %uglygep.23073 and one for %phi.1 that we created earlier. The number of new phi nodes we need to build keeps adding up as we go down the chain.

I'm not sure I understand the growth aspect here; %uglygep.13072 isn't defined or used in any of the blocks that are getting replicated? Or is the issue that the SSA algorithm spends time trying to rewrite things that don't need to be rewritten? Would using SSAUpdaterBulk in JumpThreadingPass::updateSSA help?

If we iterate from the bottom up then then we do not have this issue by design.

Not this exact issue, no, but trying to fix a local issue by making a global change tends to show other algorithmic weaknesses.

A lot of tests are throwing Iterator index out of bound error with the post order iterator.

I think the post order iterator caches some data; probably if JumpThreading is modifying blocks while you iterate, that won't work. Maybe you could use the iterator to construct a worklist or something like that.

On another note, I did run a performance experiment and did not see any regression and saw a perf 3% gain in SPEC2017/xz. And look at the test output changes I do see some more threading.

That might be an argument for changing the iteration order, if we can explain why it's happening.

When JT is called on ._crit_edge2292, %uglygep.13072 is replicated in the threaded block and a PHI node (lets say %phi.1) is inserted into the successor i.e. ._crit_edge2292.1. In the next iteration JT is called on ._crit_edge2292.1 and we now need to build two PHI nodes when rewriting the SSA. One for the new value %uglygep.23073 and one for %phi.1 that we created earlier. The number of new phi nodes we need to build keeps adding up as we go down the chain.

I'm not sure I understand the growth aspect here; %uglygep.13072 isn't defined or used in any of the blocks that are getting replicated? Or is the issue that the SSA algorithm spends time trying to rewrite things that don't need to be rewritten? Would using SSAUpdaterBulk in JumpThreadingPass::updateSSA help?

The SSA algorithm spends time trying to rewrite things that actually need to be rewritten but we have *more of these things* compared to when we do bottom up threading. I will try to see if SSAUpdaterBulk helps.

I am not sure if I am doing a good job of explaining it, try running opt -jump-threading -debug-only=jump-threading on this test file and you will quickly see what the issue is from the debug dump.
Full .ll file : https://gist.github.com/UsmanNadeem/bfa78d0390d103275792afdd21069d2c

%uglygep.13072 is defined in ._crit_edge2292 and used in .lr.ph2291.1 and .lr.ph2669.1. ._crit_edge2292 gets replicated and a phi node put in its successor ._crit_edge2292.1.
Next time ._crit_edge2292.1 gets replicated with 1 phi node and one gep, after one more iteration ._crit_edge2292.2 gets replicated with 2 phi nodes and one gep, this goes on until we reach ._crit_edge2292.24 when we have 24 phi nodes that need to be replicated.

  Threading edge from '._crit_edge2292.23.thread' to '._crit_edge2292.25, across block:

._crit_edge2292.24:                               ; preds = %._crit_edge2292.23.thread, %.lr.ph2291.24
  %uglygep.243095600 = phi ptr [ %uglygep.243095576, %._crit_edge2292.23.thread ], [ %uglygep.243095, %.lr.ph2291.24 ]
  %uglygep.223093506530599 = phi ptr [ %uglygep.223093484, %._crit_edge2292.23.thread ], [ %uglygep.223093, %.lr.ph2291.24 ]
  %uglygep.203091420442505531598 = phi ptr [ %uglygep.203091400, %._crit_edge2292.23.thread ], [ %uglygep.203091, %.lr.ph2291.24 ]
  %uglygep.183089342362419443504532597 = phi ptr [ %uglygep.183089324, %._crit_edge2292.23.thread ], [ %uglygep.183089, %.lr.ph2291.24 ]
  %uglygep.163087272290341363418444503533596 = phi ptr [ %uglygep.163087256, %._crit_edge2292.23.thread ], [ %uglygep.163087, %.lr.ph2291.24 ]
  %uglygep.143085210226271291340364417445502534595 = phi ptr [ %uglygep.143085196, %._crit_edge2292.23.thread ], [ %uglygep.143085, %.lr.ph2291.24 ]
  %uglygep.123083156170209227270292339365416446501535594 = phi ptr [ %uglygep.123083144, %._crit_edge2292.23.thread ], [ %uglygep.123083, %.lr.ph2291.24 ]
  %uglygep.103081110122155171208228269293338366415447500536593 = phi ptr [ %uglygep.103081100, %._crit_edge2292.23.thread ], [ %uglygep.103081, %.lr.ph2291.24 ]
  %uglygep.830797282109123154172207229268294337367414448499537592 = phi ptr [ %uglygep.8307964, %._crit_edge2292.23.thread ], [ %uglygep.83079, %.lr.ph2291.24 ]
  %uglygep.6307742507183108124153173206230267295336368413449498538591 = phi ptr [ %uglygep.6307736, %._crit_edge2292.23.thread ], [ %uglygep.63077, %.lr.ph2291.24 ]
  %uglygep.43075202641517084107125152174205231266296335369412450497539590 = phi ptr [ %uglygep.4307516, %._crit_edge2292.23.thread ], [ %uglygep.43075, %.lr.ph2291.24 ]
  %uglygep.23073610192740526985106126151175204232265297334370411451496540589 = phi ptr [ %uglygep.230734, %._crit_edge2292.23.thread ], [ %uglygep.23073, %.lr.ph2291.24 ]
  %uglygep.130722511182839536886105127150176203233264298333371410452495541588 = phi ptr [ %uglygep.130721, %._crit_edge2292.23.thread ], [ %uglygep.13072, %.lr.ph2291.24 ]
  %uglygep.3307412172938546787104128149177202234263299332372409453494542587 = phi ptr [ %uglygep.330749, %._crit_edge2292.23.thread ], [ %uglygep.33074, %.lr.ph2291.24 ]
  %uglygep.530763037556688103129148178201235262300331373408454493543586 = phi ptr [ %uglygep.5307625, %._crit_edge2292.23.thread ], [ %uglygep.53076, %.lr.ph2291.24 ]
  %uglygep.73078566589102130147179200236261301330374407455492544585 = phi ptr [ %uglygep.7307849, %._crit_edge2292.23.thread ], [ %uglygep.73078, %.lr.ph2291.24 ]
  %uglygep.9308090101131146180199237260302329375406456491545584 = phi ptr [ %uglygep.9308081, %._crit_edge2292.23.thread ], [ %uglygep.93080, %.lr.ph2291.24 ]
  %uglygep.113082132145181198238259303328376405457490546583 = phi ptr [ %uglygep.113082121, %._crit_edge2292.23.thread ], [ %uglygep.113082, %.lr.ph2291.24 ]
  %uglygep.133084182197239258304327377404458489547582 = phi ptr [ %uglygep.133084169, %._crit_edge2292.23.thread ], [ %uglygep.133084, %.lr.ph2291.24 ]
  %uglygep.153086240257305326378403459488548581 = phi ptr [ %uglygep.153086225, %._crit_edge2292.23.thread ], [ %uglygep.153086, %.lr.ph2291.24 ]
  %uglygep.173088306325379402460487549580 = phi ptr [ %uglygep.173088289, %._crit_edge2292.23.thread ], [ %uglygep.173088, %.lr.ph2291.24 ]
  %uglygep.193090380401461486550579 = phi ptr [ %uglygep.193090361, %._crit_edge2292.23.thread ], [ %uglygep.193090, %.lr.ph2291.24 ]
  %uglygep.213092462485551578 = phi ptr [ %uglygep.213092441, %._crit_edge2292.23.thread ], [ %uglygep.213092, %.lr.ph2291.24 ]
  %uglygep.233094552577 = phi ptr [ %uglygep.233094529, %._crit_edge2292.23.thread ], [ %uglygep.233094, %.lr.ph2291.24 ]
  %uglygep.253096 = getelementptr i8, ptr %40, i64 800, !dbg !127
  br i1 %.not2713, label %._crit_edge2292.25, label %.lr.ph2291.25, !dbg !127


JT: Renaming non-local uses of:   %uglygep.243095600 = phi ptr [ %uglygep.243095, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.223093506530599 = phi ptr [ %uglygep.223093, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.203091420442505531598 = phi ptr [ %uglygep.203091, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.183089342362419443504532597 = phi ptr [ %uglygep.183089, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.163087272290341363418444503533596 = phi ptr [ %uglygep.163087, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.143085210226271291340364417445502534595 = phi ptr [ %uglygep.143085, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.123083156170209227270292339365416446501535594 = phi ptr [ %uglygep.123083, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.103081110122155171208228269293338366415447500536593 = phi ptr [ %uglygep.103081, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.830797282109123154172207229268294337367414448499537592 = phi ptr [ %uglygep.83079, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.6307742507183108124153173206230267295336368413449498538591 = phi ptr [ %uglygep.63077, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.43075202641517084107125152174205231266296335369412450497539590 = phi ptr [ %uglygep.43075, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.23073610192740526985106126151175204232265297334370411451496540589 = phi ptr [ %uglygep.23073, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.130722511182839536886105127150176203233264298333371410452495541588 = phi ptr [ %uglygep.13072, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.3307412172938546787104128149177202234263299332372409453494542587 = phi ptr [ %uglygep.33074, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.530763037556688103129148178201235262300331373408454493543586 = phi ptr [ %uglygep.53076, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.73078566589102130147179200236261301330374407455492544585 = phi ptr [ %uglygep.73078, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.9308090101131146180199237260302329375406456491545584 = phi ptr [ %uglygep.93080, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.113082132145181198238259303328376405457490546583 = phi ptr [ %uglygep.113082, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.133084182197239258304327377404458489547582 = phi ptr [ %uglygep.133084, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.153086240257305326378403459488548581 = phi ptr [ %uglygep.153086, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.173088306325379402460487549580 = phi ptr [ %uglygep.173088, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.193090380401461486550579 = phi ptr [ %uglygep.193090, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.213092462485551578 = phi ptr [ %uglygep.213092, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.233094552577 = phi ptr [ %uglygep.233094, %.lr.ph2291.24 ]
JT: Renaming non-local uses of:   %uglygep.253096 = getelementptr i8, ptr %40, i64 800, !dbg !127

Oh, I see, the block we choose to thread across continues to get bigger, but the cost model is fine with it because the PHI nodes are "free".

I guess we could just tell the JumpThreading cost model to refuse to clone a block if there are, say, 20 or more PHI nodes in that block. Not ideal, but reduces the chance of runaway growth. (I mean, the growth is still sort of runaway; SSA construction is linear in the size of the function's CFG, so repeatedly doing it is fundamentally quadratic, but we can prevent it from getting worse than that.)

mnadeem abandoned this revision.Oct 25 2022, 3:45 PM

Oh, I see, the block we choose to thread across continues to get bigger, but the cost model is fine with it because the PHI nodes are "free".

I guess we could just tell the JumpThreading cost model to refuse to clone a block if there are, say, 20 or more PHI nodes in that block. Not ideal, but reduces the chance of runaway growth. (I mean, the growth is still sort of runaway; SSA construction is linear in the size of the function's CFG, so repeatedly doing it is fundamentally quadratic, but we can prevent it from getting worse than that.)

Abandoned in favor of D136716 which puts a limit on the number of PHI nodes.