Page MenuHomePhabricator

[MemDepAnalysis] Cut-off threshold reshuffling
Needs ReviewPublic

Authored by lebedev.ri on Sun, Jul 26, 3:31 PM.

Details

Summary

As we have seen/established in D84108, the cut-off thresholds for Memory Dependence Analysis may need tuning.
In particular, currently they are:

  • scan up to 100 instructions per block. IMO this may be a bit low. (rL179713, changed from 500)
  • Scan up to 1000 (*sic*) blocks. That seems to be somewhat high.. (D16123)
  • There is no cumulative limit on scanned instructions, so one query can end up scanning up to 100 instructions/block * 1k blocks = 100k instructions. Just as @nikic noted in https://reviews.llvm.org/D84108#2174113, that is indeed kinda just insane.

I've collected some numbers (with limits originally unset, on test-suite + RawSpeed + darktable),
what is the maximal value of a statistic for a single query per-TU:

Number of instructions scanned in block

  50%     75%     80%     90%     91%     92%     93%     94%     95%     96%     97%     98%     99%   99.5%   99.9%  99.95%    100%
30.00   65.00  114.00  176.30  209.00  246.24  293.21  358.18  388.60  430.00  480.00  593.00  920.00 2077.23 6087.00 6222.00 6222.00

Proposed action: no change, keep it at 100, which is ~80'th percentile - the cut-off is actively helpful.

Number of blocks scanned

 50%    75%    80%    90%    91%    92%    93%    94%    95%    96%    97%    98%    99%  99.5%  99.9% 99.95%   100%
10.0   39.0   55.0   61.0   67.0   77.0   88.0   94.0  104.0  125.0  145.0  179.0  278.0  355.5  670.5 1045.4 1908.0

The current limit is 1000, which is 99.95'th percentile,
i.e. we aren't really enforcing anything with it.
I propose to lower it to 99th percentile (250).

Instructions scanned total in a single query

 50%     75%     80%     90%     91%     92%     93%     94%     95%     96%     97%     98%     99%   99.5%   99.9%  99.95%    100%
66.0   318.0   336.0   463.0   498.0   518.0   571.0   683.0   791.0   986.0  1321.0  2241.0  3872.0  4847.5  8744.3  8828.0 16250.0

I propose to go with 96'th-97'th percentile, and establish a new limit of 1000,
which is a reduction of 100 times from the current implicit limit,
and 25 times less than the new implied limit.

So IOW, we will now be willing to process 4x less blocks per query,
and 100x less instructions per query total.

This does help somewhat with the D84108's lencod context_ini compile-time regression.

That being said, these are rather arbitrary cut-off's, that happen to seem sane
on the code bases viewed, so inevitably there's some code somewhere that doesn't
fit within them, and it may therefore be penalized by lower thresholds.
But compile-time budget is finite, so we have to make some decisions, sometimes.

llvm-compile-time-tracker is unsurprisingly happy about this
http://llvm-compile-time-tracker.com/compare.php?from=b1210c059d1ef084ecd275ed0ffb8343ac3cdfad&to=e3c7a0b96f8800ab21e2d5a8be414e5f83fa1aed&stat=instructions
Geomeans: -O3 -0.17%, ReleaseLTO -0.5%..-1.0%,

Diff Detail

Unit TestsFailed

TimeTest
40 mslinux > LLVM.Transforms/DeadStoreElimination::inst-limits.ll
Script: -- : 'RUN: at line 1'; /mnt/disks/ssd0/agent/llvm-project/build/bin/opt -S -dse < /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/DeadStoreElimination/inst-limits.ll | /mnt/disks/ssd0/agent/llvm-project/build/bin/FileCheck /mnt/disks/ssd0/agent/llvm-project/llvm/test/Transforms/DeadStoreElimination/inst-limits.ll
190 mslinux > Polly.ScopInfo::memcpy-raw-source.ll
Script: -- : 'RUN: at line 1'; opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/mnt/disks/ssd0/agent/llvm-project/polly/test/ScopInfo -polly-codegen-verify -basic-aa -scoped-noalias -tbaa -polly-scops -analyze < /mnt/disks/ssd0/agent/llvm-project/polly/test/ScopInfo/memcpy-raw-source.ll
80 mswindows > LLVM.Transforms/DeadStoreElimination::inst-limits.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16-1\llvm-project\premerge-checks\build\bin\opt.exe -S -dse < C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\Transforms\DeadStoreElimination\inst-limits.ll | c:\ws\w16-1\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w16-1\llvm-project\premerge-checks\llvm\test\Transforms\DeadStoreElimination\inst-limits.ll

Event Timeline

lebedev.ri created this revision.Sun, Jul 26, 3:31 PM
lebedev.ri edited the summary of this revision. (Show Details)Sun, Jul 26, 3:33 PM
lebedev.ri edited the summary of this revision. (Show Details)Mon, Jul 27, 12:20 AM
lebedev.ri edited the summary of this revision. (Show Details)
ebrevnov added a comment.EditedTue, Jul 28, 1:16 AM

I would suggest to separate an introduction of additional limit for a total number of scanned instructions and reduction of the current limit for number of scanned blocks to a different change sets. Also besides compile time numbers we should evaluate overall impact on performance before making such changes.

Proposed action: no change, keep it at 100, which is ~80'th percentile - the cut-off is actively helpful.

80'th percentile doesn't look great to me. Once we introduce a total limit for scanned instructions I would suggest increasing it to at least 90't percentile. What do you think?

llvm/include/llvm/Analysis/MemoryDependenceAnalysis.h
457 ↗(On Diff #280757)

I think we better avoid adding one more limit here since it complicates the API . It is enough to provide one limit to achieve the desired behavior. In particular a caller should specify a minimum of local and global limits.

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
217–218

I would suggest changing order of decrement and the check (as done in getSimplePointerDependencyFrom) to get more consistent behavior in case of zero default limit.

390

I guess this should be retrieving total instruction limit.

396

Minor note. It's recommended to avoid default capture modes. Would be better if you capture LocalLimit explicitly. Feel free to keep as is though.

bmahjour removed a subscriber: bmahjour.Tue, Jul 28, 6:44 AM

Some general comments:

  • The MemDepAnalysis has been known to be problematic for compile-time, so reducing the 100k implicit threshold seems reasonable.
  • It's natural the compiler tracker will be happy, but can we consider runtime implications due to potential missed optimizations?
  • Could we evaluate which optimizations are missed, in which passes using MemDepAnalysis along with their run-time impact? (in particular for the benchmarks where we see the large compile-time benefits)

AFAIK, the main passes using MemDepAnalysis are DSE, MemCpyOpt and GVN and there is active work in porting these to MemorySSA. The same analysis of compile-time vs run-time benefits is needed for that switch, so having data from reducing thresholds here will be very valuable for the short term (current patch) on deciding how much to reduce it to, and for the long-term switch to MemorySSA.

lebedev.ri marked an inline comment as done.Tue, Jul 28, 1:00 PM
lebedev.ri added inline comments.
llvm/include/llvm/Analysis/MemoryDependenceAnalysis.h
457 ↗(On Diff #280757)

So you're suggesting that the caller should also take care of updating the global limit by how much the local limit got consumed?

Some general comments:

  • The MemDepAnalysis has been known to be problematic for compile-time, so reducing the 100k implicit threshold seems reasonable.
  • It's natural the compiler tracker will be happy, but can we consider runtime implications due to potential missed optimizations?
  • Could we evaluate which optimizations are missed, in which passes using MemDepAnalysis along with their run-time impact? (in particular for the benchmarks where we see the large compile-time benefits)

AFAIK, the main passes using MemDepAnalysis are DSE, MemCpyOpt and GVN and there is active work in porting these to MemorySSA. The same analysis of compile-time vs run-time benefits is needed for that switch, so having data from reducing thresholds here will be very valuable for the short term (current patch) on deciding how much to reduce it to, and for the long-term switch to MemorySSA.

Background: the motivation for this patch is the fact that D84108 significantly
regresses compile-time of lencod, especially context_ini.c (+79.97%).
It has been traced to GVN spending most of the time in GVN::processNonLocalLoad(),
in MemoryDependenceResults::getNonLocalPointerDependency().

Since a compile-, and run- time performance assessment will be needed both here,
and in MemorySSA switch, would it be more productive to directly proceed to the latter?
Without having much(any) prior expirience with MemSSA, should it be too complicated ?
How long-term is the switch?

Some general comments:

  • The MemDepAnalysis has been known to be problematic for compile-time, so reducing the 100k implicit threshold seems reasonable.
  • It's natural the compiler tracker will be happy, but can we consider runtime implications due to potential missed optimizations?
  • Could we evaluate which optimizations are missed, in which passes using MemDepAnalysis along with their run-time impact? (in particular for the benchmarks where we see the large compile-time benefits)

AFAIK, the main passes using MemDepAnalysis are DSE, MemCpyOpt and GVN and there is active work in porting these to MemorySSA. The same analysis of compile-time vs run-time benefits is needed for that switch, so having data from reducing thresholds here will be very valuable for the short term (current patch) on deciding how much to reduce it to, and for the long-term switch to MemorySSA.

Background: the motivation for this patch is the fact that D84108 significantly
regresses compile-time of lencod, especially context_ini.c (+79.97%).
It has been traced to GVN spending most of the time in GVN::processNonLocalLoad(),
in MemoryDependenceResults::getNonLocalPointerDependency().

Since a compile-, and run- time performance assessment will be needed both here,
and in MemorySSA switch, would it be more productive to directly proceed to the latter?

I think the two are orthogonal. The current GVN will need replacing with NewGVN which has been bitrotting for a few years now. It's a pass taking a different work-flow, so I don't know if there will be a call-path matching the GVN::processNonLocalLoad() call for example.
It will be very important for the subsequent switch to have the data point of "GVN performs this processing which has relevant run-time impact, regardless of the compile-time spent, hence NewGVN needs to match this" vs "GVN spends a lot of time compiling this without much or any run-time benefit, hence NewGVN need not match it".
I understand it's time-consuming to have this analysis now, but the problem is more complex than the lowering of a cap, and it seems important for future progress to distinguish between such cases, and document the results accordingly (in MemDepAnalysis, but more importantly in GVN).

Without having much(any) prior expirience with MemSSA, should it be too complicated ?
How long-term is the switch?

I don't have a set timeline for this unfortunately, due to other priorities; order of months at this point.

fhahn added a comment.Tue, Jul 28, 2:51 PM

Some general comments:

  • The MemDepAnalysis has been known to be problematic for compile-time, so reducing the 100k implicit threshold seems reasonable.
  • It's natural the compiler tracker will be happy, but can we consider runtime implications due to potential missed optimizations?
  • Could we evaluate which optimizations are missed, in which passes using MemDepAnalysis along with their run-time impact? (in particular for the benchmarks where we see the large compile-time benefits)

AFAIK, the main passes using MemDepAnalysis are DSE, MemCpyOpt and GVN and there is active work in porting these to MemorySSA. The same analysis of compile-time vs run-time benefits is needed for that switch, so having data from reducing thresholds here will be very valuable for the short term (current patch) on deciding how much to reduce it to, and for the long-term switch to MemorySSA.

Background: the motivation for this patch is the fact that D84108 significantly
regresses compile-time of lencod, especially context_ini.c (+79.97%).
It has been traced to GVN spending most of the time in GVN::processNonLocalLoad(),
in MemoryDependenceResults::getNonLocalPointerDependency().

Since a compile-, and run- time performance assessment will be needed both here,
and in MemorySSA switch, would it be more productive to directly proceed to the latter?

I think the two are orthogonal. The current GVN will need replacing with NewGVN which has been bitrotting for a few years now. It's a pass taking a different work-flow, so I don't know if there will be a call-path matching the GVN::processNonLocalLoad() call for example.
It will be very important for the subsequent switch to have the data point of "GVN performs this processing which has relevant run-time impact, regardless of the compile-time spent, hence NewGVN needs to match this" vs "GVN spends a lot of time compiling this without much or any run-time benefit, hence NewGVN need not match it".
I understand it's time-consuming to have this analysis now, but the problem is more complex than the lowering of a cap, and it seems important for future progress to distinguish between such cases, and document the results accordingly (in MemDepAnalysis, but more importantly in GVN).

FWIW I think current GVN does too many things not directly related to value numbering, which makes the comparison with NewGVN a bit harder. I think it might make sense to de-couple some of the memory optimizations that do not really interact with the value number from GVN.

Some general comments:

  • The MemDepAnalysis has been known to be problematic for compile-time, so reducing the 100k implicit threshold seems reasonable.
  • It's natural the compiler tracker will be happy, but can we consider runtime implications due to potential missed optimizations?
  • Could we evaluate which optimizations are missed, in which passes using MemDepAnalysis along with their run-time impact? (in particular for the benchmarks where we see the large compile-time benefits)

AFAIK, the main passes using MemDepAnalysis are DSE, MemCpyOpt and GVN and there is active work in porting these to MemorySSA. The same analysis of compile-time vs run-time benefits is needed for that switch, so having data from reducing thresholds here will be very valuable for the short term (current patch) on deciding how much to reduce it to, and for the long-term switch to MemorySSA.

Background: the motivation for this patch is the fact that D84108 significantly
regresses compile-time of lencod, especially context_ini.c (+79.97%).
It has been traced to GVN spending most of the time in GVN::processNonLocalLoad(),
in MemoryDependenceResults::getNonLocalPointerDependency().

Since a compile-, and run- time performance assessment will be needed both here,
and in MemorySSA switch, would it be more productive to directly proceed to the latter?

I think the two are orthogonal. The current GVN will need replacing with NewGVN which has been bitrotting for a few years now. It's a pass taking a different work-flow, so I don't know if there will be a call-path matching the GVN::processNonLocalLoad() call for example.
It will be very important for the subsequent switch to have the data point of "GVN performs this processing which has relevant run-time impact, regardless of the compile-time spent, hence NewGVN needs to match this" vs "GVN spends a lot of time compiling this without much or any run-time benefit, hence NewGVN need not match it".
I understand it's time-consuming to have this analysis now, but the problem is more complex than the lowering of a cap, and it seems important for future progress to distinguish between such cases, and document the results accordingly (in MemDepAnalysis, but more importantly in GVN).

FWIW I think current GVN does too many things not directly related to value numbering, which makes the comparison with NewGVN a bit harder. I think it might make sense to de-couple some of the memory optimizations that do not really interact with the value number from GVN.

I wasn't really asking about NewGVN story, i know it's stagnant somewhat.
I was only asking, would it be better to instead look into porting
GVN::processNonLocalLoad() to be MemorySSA-driven.

I wasn't really asking about NewGVN story, i know it's stagnant somewhat.
I was only asking, would it be better to instead look into porting
GVN::processNonLocalLoad() to be MemorySSA-driven.

If we're building two analyses (MSSA & MemDepAnalysis) instead of one, I expect we'll see a spike in compile-time. Additionally, MemDepAnalysis is a piece of technical debt that would be nice to replace altogether.
IMO it makes sense to make the transition for the whole pass, or for the pipeline of 3 passes in one go.

ebrevnov added inline comments.Tue, Jul 28, 9:58 PM
llvm/include/llvm/Analysis/MemoryDependenceAnalysis.h
457 ↗(On Diff #280757)

That's right.

One more general comment. If we introduce budget for total number of scanned instructions do we still need to limit number of visited blocks?

nikic added a comment.Wed, Jul 29, 9:51 AM

I wasn't really asking about NewGVN story, i know it's stagnant somewhat.
I was only asking, would it be better to instead look into porting
GVN::processNonLocalLoad() to be MemorySSA-driven.

If we're building two analyses (MSSA & MemDepAnalysis) instead of one, I expect we'll see a spike in compile-time. Additionally, MemDepAnalysis is a piece of technical debt that would be nice to replace altogether.
IMO it makes sense to make the transition for the whole pass, or for the pipeline of 3 passes in one go.

I recently ran some numbers for GVN compile-time impact: http://llvm-compile-time-tracker.com/index.php?branch=nikic/perf/new-gvn-3 From the bottom to the top, the first commit disables LoadPRE, the second disables the entirety of processNonLocalLoad and the last one enables NewGVN (with the corresponding MemorySSA run).

I think the key takeaways here is that the non-local load analysis in GVN is really, really expensive. Per-block MemDep analysis is fairly cheap, but the non-local one is not, and I think GVN is the only MemDep-based pass that uses it.

So, if it is feasible to replace the non-local load analysis (and load PRE) in GVN with something MemorySSA based, I think it's plausible that it will be a compile-time improvement despite the need to construct MemorySSA. And it would become free for a following MemCpyOpt/DSE :)

Of course, this is on the assumption that MemorySSA can actually perform a similar degree of optimization with better compile-time. As these optimizations require walking past MemoryPhis, it's not going to be free, but I would still expect it to be cheaper and have better cutoff control.

I wasn't really asking about NewGVN story, i know it's stagnant somewhat.
I was only asking, would it be better to instead look into porting
GVN::processNonLocalLoad() to be MemorySSA-driven.

If we're building two analyses (MSSA & MemDepAnalysis) instead of one, I expect we'll see a spike in compile-time. Additionally, MemDepAnalysis is a piece of technical debt that would be nice to replace altogether.
IMO it makes sense to make the transition for the whole pass, or for the pipeline of 3 passes in one go.

I recently ran some numbers for GVN compile-time impact: http://llvm-compile-time-tracker.com/index.php?branch=nikic/perf/new-gvn-3 From the bottom to the top, the first commit disables LoadPRE, the second disables the entirety of processNonLocalLoad and the last one enables NewGVN (with the corresponding MemorySSA run).

I think the key takeaways here is that the non-local load analysis in GVN is really, really expensive. Per-block MemDep analysis is fairly cheap, but the non-local one is not, and I think GVN is the only MemDep-based pass that uses it.

So, if it is feasible to replace the non-local load analysis (and load PRE) in GVN with something MemorySSA based, I think it's plausible that it will be a compile-time improvement despite the need to construct MemorySSA. And it would become free for a following MemCpyOpt/DSE :)

Of course, this is on the assumption that MemorySSA can actually perform a similar degree of optimization with better compile-time. As these optimizations require walking past MemoryPhis, it's not going to be free, but I would still expect it to be cheaper and have better cutoff control.

Thank you for this analysis, this is excellent information!
I'll need to look in detail into what processNonLocalLoad does to see if MemorySSA can do better there; noted as the next priority.