This is an archive of the discontinued LLVM Phabricator instance.

[Loads] Skip non load/store instructions when finding available load
Needs ReviewPublic

Authored by tejohnson on Apr 29 2021, 10:32 AM.

Details

Reviewers
nikic
lebedev.ri
Summary

Detect and skip non load and store instructions, up to a limit,
when finding available loads for a pointer. The scan limit is only 6
instructions which is fairly small, so for example type test and assume
intrinsics inserted for whole program devirtualization would prevent
analysis, sometimes causing profile matching issues. On the other hand,
debug and pseudo instructions were skipped indefinitely.

We now skip all of the unrelated instruction types when scanning for an
available load or intervening memory write, up to a default limit of 32
instructions. This supercedes the unlimited skipping of debug and pseudo
instructions that previously existed.

This exposed an issue where a caller from jump threading could end up
infinitely looping due to an unreachable loop with a single predecessor
(the backedge). It was not exposed before because the branch instruction
was counted towards the scan limit of 6 instructions, so iteration would
terminate. Fix this and added a test case.

Diff Detail

Event Timeline

tejohnson created this revision.Apr 29 2021, 10:32 AM
tejohnson requested review of this revision.Apr 29 2021, 10:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2021, 10:32 AM
nikic added a comment.Apr 29 2021, 1:04 PM

Not a big fan of this -- this is effectively whitelisting a specific pattern, and doesn't generalize.

For this code, there are two limits of interest: The total number of instructions we look at, and the number of alias-analysis queries we perform. Currently we limit both through a small number of total instructions. However, the only expensive part, and what needs to be aggressively limited, is the number of alias analysis queries.

I think what we can do here is to have a relatively liberal upper limit on the number of scanned instructions (say 32), while having a tight limit on the number of AA queries or AA query candidates (say 4). This should make the code more resilient against additional bitcast/assume instructions in between, while still limiting compile-time impact.

Do you think something along those lines would address your motivation here?

Not a big fan of this -- this is effectively whitelisting a specific pattern, and doesn't generalize.

For this code, there are two limits of interest: The total number of instructions we look at, and the number of alias-analysis queries we perform. Currently we limit both through a small number of total instructions. However, the only expensive part, and what needs to be aggressively limited, is the number of alias analysis queries.

There's sort of 3 categories of instructions from what I can tell, in increasing order of expense:

  1. Insts that are not loads or stores and don't modify memory - these can be skipped immediately
  2. Loads and stores which are compared to the given pointer by comparing the pointer values but not with AA (in getAvailableLoadStore)
  3. (Subset of category 2) Instructions that are compared using AA analysis (insts that may write memory)

I think what we can do here is to have a relatively liberal upper limit on the number of scanned instructions (say 32), while having a tight limit on the number of AA queries or AA query candidates (say 4). This should make the code more resilient against additional bitcast/assume instructions in between, while still limiting compile-time impact.

Do you think something along those lines would address your motivation here?

It would help, but we could still end up with corner cases where with -fwhole-program-vtables enabled for the optimization build we might end up with IR profile matching issues due to different simplification optimizations prior. Although with a larger window the chances of that would presumably be reduced.

One thing I am wondering is whether for the first category above (non load/store/may write memory), if there needs to be any limit since they can be skipped immediately (e.g. instead of just dbg or pseudo insts as it does now). I am not sure how expensive it would be to allow quickly skipping through all of them in the block as we scan for memory accesses. I did some measurements for a large binary build with ~20k input files. With that change, essentially to convert:

if (isa<DbgInfoIntrinsic>(Inst))
  continue;

to

if (!isa<LoadInst>(&Inst) && !isa<StoreInst>(&Inst) && !Inst.mayWriteToMemory())
  continue;

across all the input files there are about 40% more iterations of the loop in findAvailablePtrLoadStore and about 47% more iterations in FindAvailableLoadedValue. I left the existing default scan limit of 6 for the loads/stores/maywrite insts, so these were largely just skipping through without any analysis. I didn't compare the compile times of all of the individual files in detail, but the slowest compiling files did not increase in compile time nor did the overall build.

If that isn't acceptable, perhaps a larger limit such as 100? Note this could either subsume the existing unlimited skipping of dbg and pseudo instructions, or allow those to be skipped unconditionally as now. Right now with the current skipping of all dbg and pseudo instructions, the max iteration of this loop is ~1K, so if we include those instructions in the limit it will have a potentially negative effect on optimization over the status quo.

Prazek added inline comments.May 2 2021, 6:22 AM
llvm/lib/Analysis/Loads.cpp
649–655

nit: This comment (which is very useful) is a big duplication in code.
What do you think of wrapping this chunk of code to a function, so that either logic, or the comment will not get out of sync?
Something like:

if (isUnrelatedToLoadOrStore(Inst)) // Probably could be named better.
  continue;
tejohnson added inline comments.May 4 2021, 11:47 AM
llvm/lib/Analysis/Loads.cpp
649–655

I ended up simplifying this code which meant it no longer required a big comment about these particular instruction types.

tejohnson retitled this revision from [Loads] Ignore type test assume sequences inserted for devirtualization to [Loads] Skip non load/store instructions when finding available load.May 4 2021, 11:48 AM
tejohnson edited the summary of this revision. (Show Details)
tejohnson updated this revision to Diff 342812.May 4 2021, 11:48 AM

Update patch as discussed, to skip a larger number of non ld/st instructions

Not a big fan of this -- this is effectively whitelisting a specific pattern, and doesn't generalize.

For this code, there are two limits of interest: The total number of instructions we look at, and the number of alias-analysis queries we perform. Currently we limit both through a small number of total instructions. However, the only expensive part, and what needs to be aggressively limited, is the number of alias analysis queries.

I think what we can do here is to have a relatively liberal upper limit on the number of scanned instructions (say 32), while having a tight limit on the number of AA queries or AA query candidates (say 4). This should make the code more resilient against additional bitcast/assume instructions in between, while still limiting compile-time impact.

Do you think something along those lines would address your motivation here?

@nikic ping, I believe I implemented your suggested alternative, ptal

lebedev.ri resigned from this revision.Jan 12 2023, 5:30 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:30 PM
Herald added a subscriber: StephenFan. · View Herald Transcript