This is an archive of the discontinued LLVM Phabricator instance.

[PM] Avoid duplicates in the Used/Preserved/Required sets
ClosedPublic

Authored by bjope on Jan 11 2021, 7:15 AM.

Details

Summary

The pass analysis uses "sets" implemented using a SmallVector type
to keep track of Used, Preserved, Required and RequiredTransitive
passes. When having nested analyses we could end up with duplicates
in those sets, as there was no checks to see if a pass already
existed in the "set" before pushing to the vectors. This idea with
this patch is to avoid such duplicates by avoiding pushing elements
that already is contained when adding elements to those sets.

To align with the above PMDataManager::collectRequiredAndUsedAnalyses
is changed to skip adding both the Required and RequiredTransitive
passes to its result vectors (since RequiredTransitive always is
a subset of Required we ended up with duplicates when traversing
both sets).

Main goal with this is to avoid spending time verifying the same
analysis mulitple times in PMDataManager::verifyPreservedAnalysis
when iterating over the Preserved "set". It is assumed that removing
duplicates from a "set" shouldn't have any other negative impact
(I have not seen any problems so far). If this ends up causing
problems one could do some uniqueness filtering of the vector being
traversed in verifyPreservedAnalysis instead.

Diff Detail

Event Timeline

bjope created this revision.Jan 11 2021, 7:15 AM
bjope requested review of this revision.Jan 11 2021, 7:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2021, 7:15 AM
bjope added a reviewer: foad.Jan 12 2021, 4:57 AM
foad added a comment.Jan 12 2021, 9:02 AM

Does this make any difference in practice? E.g. does the output of opt -O1 -debug-pass=Executions change, or can you measure any timing difference?

bjope added a comment.Jan 12 2021, 9:36 AM

Does this make any difference in practice? E.g. does the output of opt -O1 -debug-pass=Executions change, or can you measure any timing difference?

Currently -debug-pass isn't showing when verifyAnalysis is called. But if we add a debug printout in PMDataManager::verifyPreservedAnalysis that prints the analysis passes that are verified (either for -debug-pass=Executions or -debug-pass=Details) then it would be possible to create a test case that show the difference in number of verifyAnalysis calls that we get.

There might be some overhead when setting up the pipeline (due to the llvm::is_contained calls), while we also same some work as we no longer need to deal with duplicates when iterating over the different sets. To see a speedup due to less calls to verifyAnalysis one would need to run lots of test cases with the verifiers enabled (and maybe also with EXPENSIVE_CHECKS=ON).
I haven't done any timing measurements. I've just assumed that it won't be worse if doing less duplicated work.

bjope added a comment.Jan 12 2021, 9:43 AM

I created this patch as my earlier patch (https://reviews.llvm.org/D94138) resulted in even more duplicates in the preserved sets. So the idea was to compensate a bit for a potential speed regression, when using verifiers, by simply getting rid of all duplicates in the sets.

But maybe it isn't worth the churn to proceed with this considering that legacy PM is supposed to be phased out?

foad added a comment.Jan 12 2021, 9:51 AM

I think improving the legacy pass manager is fine (after all it is taking a very long time for the new pass manager to supersede it). But I think any patch that claims to do less work overall, or speed something up, needs *some* kind of evidence that it actually has the desired effect.

bjope added a comment.Jan 13 2021, 1:01 PM

I've now tried to do some performance comparisons using perf stat -r 100 opt -O3 -o /dev/null --verify-dom-info -verify-assumption-cache -verify-loop-info.

Tried to find a test case in-tree that has several basic blocks (to actually spend some time in the verifiers). Here is the result ("opt-old" is without this patch and "opt-new" is with this patch):

Performance counter stats for 'opt-old -O3 -o /dev/null --verify-dom-info -verify-assumption-cache -verify-loop-info test/Analysis/LoopNestAnalysis/imperfectnest.ll' (100 runs):

            76.38 msec task-clock:u              #    0.973 CPUs utilized            ( +-  0.63% )
                0      context-switches:u        #    0.000 K/sec                  
                0      cpu-migrations:u          #    0.000 K/sec                  
            4,141      page-faults:u             #    0.054 M/sec                    ( +-  0.00% )
      217,856,171      cycles:u                  #    2.852 GHz                      ( +-  0.34% )
      425,842,775      instructions:u            #    1.95  insn per cycle           ( +-  0.05% )
       87,203,931      branches:u                # 1141.767 M/sec                    ( +-  0.05% )
        1,675,703      branch-misses:u           #    1.92% of all branches          ( +-  0.24% )

         0.078464 +- 0.000485 seconds time elapsed  ( +-  0.62% )


Performance counter stats for 'opt-new -O3 -o /dev/null --verify-dom-info -verify-assumption-cache -verify-loop-info test/Analysis/LoopNestAnalysis/imperfectnest.ll' (100 runs):

            63.83 msec task-clock:u              #    0.963 CPUs utilized            ( +-  0.97% )
                0      context-switches:u        #    0.000 K/sec                  
                0      cpu-migrations:u          #    0.000 K/sec                  
            4,124      page-faults:u             #    0.065 M/sec                    ( +-  0.00% )
      180,555,724      cycles:u                  #    2.828 GHz                      ( +-  1.01% )
      323,290,930      instructions:u            #    1.79  insn per cycle           ( +-  0.04% )
       66,711,896      branches:u                # 1045.074 M/sec                    ( +-  0.04% )
        1,532,334      branch-misses:u           #    2.30% of all branches          ( +-  0.35% )

         0.066310 +- 0.000631 seconds time elapsed  ( +-  0.95% )

So for that particular test this patch looks like a win.

I've also compared the result with an empty input and without verifiers. This is to see if there is an overhead when not using verifiers:

Performance counter stats for 'opt-old -O3 -o /dev/null /dev/null' (100 runs):

            12.71 msec task-clock:u              #    0.885 CPUs utilized            ( +-  0.96% )
                0      context-switches:u        #    0.000 K/sec                  
                0      cpu-migrations:u          #    0.000 K/sec                  
            2,774      page-faults:u             #    0.218 M/sec                    ( +-  0.00% )
       22,026,642      cycles:u                  #    1.733 GHz                      ( +-  1.02% )
       25,342,570      instructions:u            #    1.15  insn per cycle           ( +-  0.00% )
        6,191,303      branches:u                #  487.180 M/sec                    ( +-  0.00% )
          139,025      branch-misses:u           #    2.25% of all branches          ( +-  0.29% )

         0.014362 +- 0.000132 seconds time elapsed  ( +-  0.92% )

Performance counter stats for 'opt-new -O3 -o /dev/null /dev/null' (100 runs):

            12.80 msec task-clock:u              #    0.883 CPUs utilized            ( +-  1.02% )
                0      context-switches:u        #    0.000 K/sec                  
                0      cpu-migrations:u          #    0.000 K/sec                  
            2,760      page-faults:u             #    0.216 M/sec                    ( +-  0.00% )
       21,916,544      cycles:u                  #    1.712 GHz                      ( +-  1.19% )
       24,035,752      instructions:u            #    1.10  insn per cycle           ( +-  0.00% )
        5,788,238      branches:u                #  452.062 M/sec                    ( +-  0.00% )
          139,649      branch-misses:u           #    2.41% of all branches          ( +-  1.14% )

         0.014502 +- 0.000147 seconds time elapsed  ( +-  1.01% )

So even without anything to verify etc, the number of instructions/branches are smaller with the patch (looking at cycles and task-clock is more inconclusive since it varies a bit on my test server, but I'd say that opt-old and opt-new performs equally good here).

foad accepted this revision.Jan 14 2021, 2:47 AM

Thanks for doing the timings. The patch seems reasonable to me. I'll approve it but it would be great if you could find at least one other reviewer to take a look too.

An alternative implementation would be to use SetVector instead of SmallVector for each of the Required, RequiredTransitive, Preserved and Used sets.

Incidentally I tried a similar kind of optimisation here: https://reviews.llvm.org/differential/diff/316602/
It's still work in progress because (a) I still don't fully understand setLastUser and (b) I haven't done any performance measurements on it.

This revision is now accepted and ready to land.Jan 14 2021, 2:47 AM
This revision was landed with ongoing or failed builds.Jan 20 2021, 4:56 AM
This revision was automatically updated to reflect the committed changes.