Page MenuHomePhabricator

[MemDepAnalysis] Allow caller to pass in an OrderedBasicBlock.
ClosedPublic

Authored by fhahn on Mar 25 2019, 11:18 AM.

Details

Summary

If the caller can preserve the OBB, we can avoid recomputing the order
for each getDependency call.

Event Timeline

fhahn created this revision.Mar 25 2019, 11:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2019, 11:18 AM
rnk added a comment.Mar 25 2019, 11:44 AM

I was working on restarting the in-instruction order tracking discussion, but the pathological test case I came up with didn't benefit from it:

const std::vector<unsigned> &lookup(std::string k) {
  static std::map<std::string, std::vector<unsigned>> the_map = {
    {"s", {1, 2, 3}},
    {"s", {1, 2, 3}},
    ... repeat 200x, or arbitrarily long
  };
  return the_map[k];
}

This causes clang to generate a very large BB that DSE spends a lot of time on. However, invalidating the order numbers only on removal wasn't enough to drastically improve performance, so I lost the motivation to drive it through. The ordering helped significantly with my less simplified test case (std::map<unsigned, std::vector<StringRef>>, see rC345329 which removed it), but I wasn't happy with it.

llvm/include/llvm/Analysis/MemoryDependenceAnalysis.h
456–457

This function only has one real call site, which is in getPointerDependencyFrom. Can you make these parameters non-optional?

llvm/lib/Analysis/MemoryDependenceAnalysis.cpp
445–448

I've seen this pattern across LLVM, and I've never particularly liked it. I don't think it's as easy to optimize as the author probably thought it was. I think this would be simpler:

unsigned DefaultLimit = BlockScanLimit;
if (!Limit)
  Limit = &DefaultLimit;
...
fhahn updated this revision to Diff 192322.Mar 26 2019, 12:23 PM
fhahn marked 2 inline comments as done.

Addressed comments, thanks!

In D59788#1441992, @rnk wrote:

I was working on restarting the in-instruction order tracking discussion, but the pathological test case I came up with didn't benefit from it:

const std::vector<unsigned> &lookup(std::string k) {
  static std::map<std::string, std::vector<unsigned>> the_map = {
    {"s", {1, 2, 3}},
    {"s", {1, 2, 3}},
    ... repeat 200x, or arbitrarily long
  };
  return the_map[k];
}

This causes clang to generate a very large BB that DSE spends a lot of time on. However, invalidating the order numbers only on removal wasn't enough to drastically improve performance, so I lost the motivation to drive it through. The ordering helped significantly with my less simplified test case (std::map<unsigned, std::vector<StringRef>>, see rC345329 which removed it), but I wasn't happy with it.

Yes, it looks like there is another compile time problem in DSE, which is independent of the order query issue. I have an input file with a huge number of stores, and the majority of the time is spent getting dependency info (but not in OrderedBB). It is probably the same issue you are seeing. Did you create a PR for it already?

rnk added a comment.Mar 26 2019, 12:49 PM

Yes, it looks like there is another compile time problem in DSE, which is independent of the order query issue. I have an input file with a huge number of stores, and the majority of the time is spent getting dependency info (but not in OrderedBB). It is probably the same issue you are seeing. Did you create a PR for it already?

Nope, go for it.

fhahn added a comment.Mar 26 2019, 3:05 PM
In D59788#1443490, @rnk wrote:

Nope, go for it.

I'll do it tomorrow. Does the patch look ok to you now?

In D59788#1441992, @rnk wrote:

I was working on restarting the in-instruction order tracking discussion, but the pathological test case I came up with didn't benefit from it:

const std::vector<unsigned> &lookup(std::string k) {
  static std::map<std::string, std::vector<unsigned>> the_map = {
    {"s", {1, 2, 3}},
    {"s", {1, 2, 3}},
    ... repeat 200x, or arbitrarily long
  };
  return the_map[k];
}

This causes clang to generate a very large BB that DSE spends a lot of time on. However, invalidating the order numbers only on removal wasn't enough to drastically improve performance, so I lost the motivation to drive it through. The ordering helped significantly with my less simplified test case (std::map<unsigned, std::vector<StringRef>>, see rC345329 which removed it), but I wasn't happy with it.

@rnk is there any chance you still have the full reproducer? I did not manage to reproduce the long time spent in DSE with the snippet above and I cannot share the reproducer I have unfortunately :(

rnk added a comment.Mar 27 2019, 11:17 AM

@rnk is there any chance you still have the full reproducer? I did not manage to reproduce the long time spent in DSE with the snippet above and I cannot share the reproducer I have unfortunately :(

Try this bitcode I attached in the tracker:
https://bugs.llvm.org/attachment.cgi?id=21689

This is the time report I see for it:

$ opt -time-passes -O2 longbbinit.bc -o NUL
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 16.4531 seconds (16.4770 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   9.8281 ( 64.8%)   1.2188 ( 95.1%)  11.0469 ( 67.1%)  11.0376 ( 67.0%)  Dead Store Elimination
   1.7344 ( 11.4%)   0.0000 (  0.0%)   1.7344 ( 10.5%)   1.6974 ( 10.3%)  Global Value Numbering
   0.6563 (  4.3%)   0.0156 (  1.2%)   0.6719 (  4.1%)   0.6795 (  4.1%)  Function Integration/Inlining
   0.3750 (  2.5%)   0.0000 (  0.0%)   0.3750 (  2.3%)   0.3846 (  2.3%)  Value Propagation
   0.3750 (  2.5%)   0.0000 (  0.0%)   0.3750 (  2.3%)   0.3743 (  2.3%)  Value Propagation #2
   0.2344 (  1.5%)   0.0156 (  1.2%)   0.2500 (  1.5%)   0.2515 (  1.5%)  Bitcode Writer

I'm not sure it's the exact issue you're looking at here, though.

fhahn added a comment.Mar 27 2019, 2:28 PM
In D59788#1444919, @rnk wrote:

@rnk is there any chance you still have the full reproducer? I did not manage to reproduce the long time spent in DSE with the snippet above and I cannot share the reproducer I have unfortunately :(

Try this bitcode I attached in the tracker:
https://bugs.llvm.org/attachment.cgi?id=21689

This is the time report I see for it:

$ opt -time-passes -O2 longbbinit.bc -o NUL
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 16.4531 seconds (16.4770 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   9.8281 ( 64.8%)   1.2188 ( 95.1%)  11.0469 ( 67.1%)  11.0376 ( 67.0%)  Dead Store Elimination
   1.7344 ( 11.4%)   0.0000 (  0.0%)   1.7344 ( 10.5%)   1.6974 ( 10.3%)  Global Value Numbering
   0.6563 (  4.3%)   0.0156 (  1.2%)   0.6719 (  4.1%)   0.6795 (  4.1%)  Function Integration/Inlining
   0.3750 (  2.5%)   0.0000 (  0.0%)   0.3750 (  2.3%)   0.3846 (  2.3%)  Value Propagation
   0.3750 (  2.5%)   0.0000 (  0.0%)   0.3750 (  2.3%)   0.3743 (  2.3%)  Value Propagation #2
   0.2344 (  1.5%)   0.0156 (  1.2%)   0.2500 (  1.5%)   0.2515 (  1.5%)  Bitcode Writer

I'm not sure it's the exact issue you're looking at here, though.

Thanks, that looks promising!

Are you happy with the update to this patch?

rnk accepted this revision.Mar 28 2019, 11:41 AM

Yep, lgtm!

This revision is now accepted and ready to land.Mar 28 2019, 11:41 AM
This revision was automatically updated to reflect the committed changes.