This is an archive of the discontinued LLVM Phabricator instance.

[BypassSlowDivision] Do not bypass division of hash-like values
ClosedPublic

Authored by n.bozhenov on Dec 31 2016, 7:06 AM.

Details

Summary

Disable bypassing if one of the operands looks like a hash value. Slow
division often occurs in hashtable implementations and fast division is
never taken there because a hash value is extremely unlikely to have
enough upper bits set to zero.

A value is considered to be hash-like if it is produced by a

  1. XOR operation
  2. Multiplication by a constant wider than the shorter type
  3. PHI instruction with at least one XOR or MUL+CONST operand.

Diff Detail

Repository
rL LLVM

Event Timeline

n.bozhenov updated this revision to Diff 82767.Dec 31 2016, 7:06 AM
n.bozhenov retitled this revision from to [BypassSlowDivision] Do not bypass division of hash-like values.
n.bozhenov updated this object.
jlebar added inline comments.Dec 31 2016, 10:20 AM
lib/Transforms/Utils/BypassSlowDivision.cpp
78 ↗(On Diff #82767)

s/the value/a value/

88 ↗(On Diff #82767)

Suggest s/quite// and s/generally// -- we weaken it with "generally" then strengthen it with "quite", and then what do we mean exactly? :)

88 ↗(On Diff #82767)

unlikely that such values will fit into the shorter type

(If you want to use "for", you could use "uncommon for such values to fit into the shorter type", but I think that's less good here.)

105 ↗(On Diff #82767)

s/At this point//

106 ↗(On Diff #82767)

No comma needed after "So" here.

106 ↗(On Diff #82767)

I don't actually understand what you mean by the first sentence. Can you try to rephrase wrt what the bitcast has to do with constant hoisting?

107 ↗(On Diff #82767)

actually a constant

125 ↗(On Diff #82767)

We don't need this case?

157 ↗(On Diff #82767)

I'd just get rid of the last sentence entirely -- it's clear what's going on.

159 ↗(On Diff #82767)

We only want to do this check for the dividend, right? I mean, I guess it doesn't really hurt to do it for the divisor too, but that isn't motivated by hashtables.

test/CodeGen/X86/bypass-slow-division-64.ll
138 ↗(On Diff #82767)

Can we test with a constant that *does* fit into 32 bits?

sanjoy added a subscriber: sanjoy.Dec 31 2016, 10:37 AM

Some minor drop-by comments inline

lib/Transforms/Utils/BypassSlowDivision.cpp
93 ↗(On Diff #82767)

LLVM style is Width, Depth.

118 ↗(On Diff #82767)

Why not:

bool Found = llvm::any_of(P->incoming_values(), [&](Value *V) {
  return isHashLineValue(V, Width, Depth);
});

if (Found)
  return true;
n.bozhenov updated this revision to Diff 89518.Feb 23 2017, 9:00 AM

The patch has been rebased onto D29897.

n.bozhenov marked 8 inline comments as done.Feb 23 2017, 9:01 AM
n.bozhenov added inline comments.
lib/Transforms/Utils/BypassSlowDivision.cpp
118 ↗(On Diff #82767)

That's cool! But do you think such code would be really easier to read? :)

125 ↗(On Diff #82767)

My idea was to add an explicit do-nothing default branch because we are not going to cover all possible enumeration values. But it seems not working, because getOpcode() returns a plain unsigned int. I have deleted the default branch.

159 ↗(On Diff #82767)

Basically, yes. This patch was motivated by hashtables and I expect this check to fire mostly for dividends. However, I tried to make the patch as generic as possible. And if we find a divisor that looks like a hash value, disabling bypassing for such a division operation will be a good idea.

test/CodeGen/X86/bypass-slow-division-64.ll
138 ↗(On Diff #82767)

Isn't the next test what you're talking about?

jlebar added inline comments.Feb 23 2017, 2:05 PM
lib/Transforms/Utils/BypassSlowDivision.cpp
228 ↗(On Diff #89518)

Perhaps we can write this as

return C && C->getValue()...

And then below, we can write

return any_of(P->incoming_values(), [&](Value *V) {
  return isHashLikeValue(V, Width, Depth);
});

We could even move the cast into the any_of call.

return any_of(cast<PHINode>(I)->incoming_values(), [&](Value *V) {
  return isHashLikeValue(V, Width, Depth);
});

This is, I think, in the spirit of the LLVM style guide's section about "reducing nesting" and having early returns / breaks.

265 ↗(On Diff #89518)

Maybe s/is supposed to detect/tries to detect/

118 ↗(On Diff #82767)

I do, because it makes the high-level structure of the program explicit. That is, rather than saying "do a loop that does some stuff", we say "check if any elements have property X".

test/CodeGen/X86/bypass-slow-division-fnv.ll
3 ↗(On Diff #89518)

Like in the earlier discussion, can we just test this in LLVM IR?

Instead of testing loop unrolling, could we just run your testcase through opt -O3 -unroll-count=4 (maybe we only need unroll-count=2) and check that in? Then here we'd just need to run codegenprepare and check the output of that.

We prefer for unit tests to test small units of code, instead of running the whole optimization pipeline. Whole-pipeline tests are better suited for the test-suite.

10 ↗(On Diff #89518)

again s/is supposed to verify/verifies

12 ↗(On Diff #89518)

Starting with "reorganized", I don't understand what this is saying, or how it meshes with the comment at the top of the file. Can we harmonize these comments and maybe just have one comment instead of two?

n.bozhenov updated this revision to Diff 90229.Mar 1 2017, 2:04 PM
n.bozhenov marked an inline comment as done.

Moved most of the tests into CodeGenPrepare/NVPTX and changed RUN line in bypass-slow-division-fnv.ll.

n.bozhenov marked 3 inline comments as done.Mar 1 2017, 2:11 PM
n.bozhenov added inline comments.
test/CodeGen/X86/bypass-slow-division-fnv.ll
3 ↗(On Diff #89518)

We prefer for unit tests to test small units of code, instead of running the whole optimization pipeline. Whole-pipeline tests are better suited for the test-suite.

I believe that the main difference is that the test-suite is for execution tests and unit testing is for lit-tests, isn't it?

Instead of testing loop unrolling, could we just run your testcase through opt -O3 -unroll-count=4 (maybe we only need unroll-count=2) and check that in? Then here we'd just need to run codegenprepare and check the output of that.

That wouldn't make much sense. The recognized patterns are already checked with simpler tests I have put into bypass-slow-div-special-cases.ll. The purpose of this particular file is to make sure that the patterns we recognize are indeed the patterns produced by the middle-end at the maximal optimization level (I have removed the explicit unroll-count option).

The problem this test tries to solve is that the PHI-heuristic in isHashLikeValue is quite fragile. For example, it could be broken if some loop optimization (unrolling/vectorization/whatever) would produce slightly different code with an additional instruction between XOR and REM. This test verifies that there are no such additional instructions and the code is correctly recognized as hash-like.

And since the set of middle-end optimizations and their parameters are target-specific, I would like to have this test X86-specific. It is not an automatically generated test, it has only one CHECK statement, and therefore it is not a problem to have it as an llc test.

jlebar edited edge metadata.EditedMar 2 2017, 3:42 PM

That wouldn't make much sense. The recognized patterns are already checked with simpler tests I have put into bypass-slow-div-special-cases.ll. The purpose of this particular file is to make sure that the patterns we recognize are indeed the patterns produced by the middle-end at the maximal optimization level (I have removed the explicit unroll-count option).

It sounds like there's an implicit "on this particular testcase" at the end of this sentence. "We are checking that we recognize the patterns produced by the middle-end at -O3 *on this particular testcase*."

The problem this test tries to solve is that the PHI-heuristic in isHashLikeValue is quite fragile. For example, it could be broken if some loop optimization (unrolling/vectorization/whatever) would produce slightly different code with an additional instruction between XOR and REM. This test verifies that there are no such additional instructions and the code is correctly recognized as hash-like.

It seems to me that if you think the code is fragile as-is, we should try to make it less fragile, so that a test targeting one specific testcase isn't necessary. For instance, we could increase the search depth some? This would obviate the need for an uber-specific, full-pipeline testcase. It would also keep us honest, in that we're not optimizing for one particular chunk of code.

And since the set of middle-end optimizations and their parameters are target-specific, I would like to have this test X86-specific. It is not an automatically generated test, it has only one CHECK statement, and therefore it is not a problem to have it as an llc test.

If we we really wanted to run the full pipeline, I don't see why we couldn't simply run opt -O3 | opt -codegenprepare. Can you think of a specific bug in this pass that would not be caught by this that would be caught by the lit test?

That wouldn't make much sense. The recognized patterns are already checked with simpler tests I have put into bypass-slow-div-special-cases.ll. The purpose of this particular file is to make sure that the patterns we recognize are indeed the patterns produced by the middle-end at the maximal optimization level (I have removed the explicit unroll-count option).

It sounds like there's an implicit "on this particular testcase" at the end of this sentence. "We are checking that we recognize the patterns produced by the middle-end at -O3 *on this particular testcase*."

Of course, on this particular testcase. But it is not some arbitrary testcase. While working on this patch I wanted to come up with an heuristic to recognize the most important hashing algorithms. And FNV algorithms are definitely very important algorithms. For example, they are used to calculate hashes for strings in both libstdc++ and libc++.

Unfortunately, the implemented heuristic is still incapable of disabling division bypassing in std::unordered_map<std::string, whatever>. The problem is that the calculated hash value is put to memory before division. So, while I'm happy with XOR and MUL heuristics, I'm not quite satisfied with the PHI heuristic.

The problem this test tries to solve is that the PHI-heuristic in isHashLikeValue is quite fragile. For example, it could be broken if some loop optimization (unrolling/vectorization/whatever) would produce slightly different code with an additional instruction between XOR and REM. This test verifies that there are no such additional instructions and the code is correctly recognized as hash-like.

It seems to me that if you think the code is fragile as-is, we should try to make it less fragile, so that a test targeting one specific testcase isn't necessary. For instance, we could increase the search depth some? This would obviate the need for an uber-specific, full-pipeline testcase. It would also keep us honest, in that we're not optimizing for one particular chunk of code.

Just increasing the search depth doesn't sound right to me because we would be at risk of having many false positive results.

After thinking it over once again, my idea now is to change the PHI heuristic. I believe a better approach would be to test all values instead of any_of and to use a check like

return all_of(P->incoming_values(), [&](Value *V) {
  return getValueRange(V) == VALRNG_LONG;
});

This way it would also be safe (no false positives) to increase the search depth and probably make it unlimited.

I will try to prepare a new patch and upload it by tomorrow.

n.bozhenov updated this revision to Diff 90608.Mar 5 2017, 6:43 AM

Changed the PHI heuristic as planned.

And since the set of middle-end optimizations and their parameters are target-specific, I would like to have this test X86-specific. It is not an automatically generated test, it has only one CHECK statement, and therefore it is not a problem to have it as an llc test.

If we we really wanted to run the full pipeline, I don't see why we couldn't simply run opt -O3 | opt -codegenprepare. Can you think of a specific bug in this pass that would not be caught by this that would be caught by the lit test?

The reason is that opt -codegenprapare doesn't run BypassSlowDivision for X86. So, I have to run the whole X86 backend to check that the division is not bypassed.

I believe that the new version of the heuristic is much more robust than the previous one. However, a naive implementation didn't work initially. It turned out that hash values weren't recognized because of PHI nodes with undef incoming values generated by loop-unroll. I had to amend the patch to work around the problem.

Earlier I had problems with bitcast instructions. Only after running the whole pipeline test I found that sometimes long constants are hidden behind bitcast instructions.

Obviously, a manually created test would never contain such unexpected things.

Previously you suggested manually running the testcase through opt -O3 ... and checking that in. This wouldn't work either. Some future change in the middle-end will break the pattern recognition in codegenprepare but we never find it out because our codegen input stays the same.

So, I believe that it is worth having a full pipeline test to catch anything unexpected coming from the middle-end and breaking the pattern recognition.

jlebar requested changes to this revision.Mar 5 2017, 11:01 AM
jlebar added a subscriber: hfinkel.

The reason is that opt -codegenprapare doesn't run BypassSlowDivision for X86. So, I have to run the whole X86 backend to check that the division is not bypassed.

We can't make it an NVPTX test like we did the last time we had this problem?

I believe that the new version of the heuristic is much more robust than the previous one. However, a naive implementation didn't work initially. It turned out that hash values weren't recognized because of PHI nodes with undef incoming values generated by loop-unroll. I had to amend the patch to work around the problem.

It doesn't look like you added any "unit" tests to cover this changed behavior. Was that intentional?

Previously you suggested manually running the testcase through opt -O3 ... and checking that in. This wouldn't work either. Some future change in the middle-end will break the pattern recognition in codegenprepare but we never find it out because our codegen input stays the same.

Sure. It's also true that clang might change how it emits this code, or that you might have a front-end that isn't clang and also emits this code differently.

Ultimately I'm not comfortable LGTM'ing this patch so long as it contains this -O3 test. This is very different than how I understood we write tests. But that might just be my ignorance speaking; what you're trying to do may in fact be perfectly fine; I really don't know enough to say for sure. Let's get a second opinion? If @hfinkel or someone thinks this is the right approach, I'm happy.

lib/Transforms/Utils/BypassSlowDivision.cpp
85 ↗(On Diff #90608)

I would prefer VisitedSetTy or something like that.

85 ↗(On Diff #90608)

SmallSet<T*> is a typedef for SmallPtrSet. There are 3x as many mentions of SmallPtrSet as SmallSet in LLVM (that's counting all SmallSets, not just SmallSets of pointers), so I think we should call this SmallPtrSet (and change the #include).

229 ↗(On Diff #90608)

Can we add a comment explaining why we return true here?

230 ↗(On Diff #90608)

Sounds like we should call this VisitedPhis?

231 ↗(On Diff #90608)

Why the change from any_of to all_of? You said earlier:

This way it would also be safe (no false positives) to increase the search depth and probably make it unlimited.

I am not sure what you mean by this. Are you making two claims about this change:

a. It's "safe" in that it reduces our false positives, and
b. It lets us safely increase the search depth to infinity?

I don't see how the switch to any_of changes anything with respect to the cost of the recursion. I agree that pretty much any sane code is not going to have deep recursion here. But that's not what I'm worried about. I'm worried about some crazy code that causes us to have quadratic behavior. *That's* what the depth limit is there to protect against.

With respect to (a) I see "safety" opposite of how you've described it. If we declare that something is hash-like, we won't optimize it, and not optimizing is the "safe" thing to do. Therefore the safe/conservative choice is to declare something as "maybe hash-like". Therefore any_of is safer / more conservative than all_of.

I do have a major safety concern here, but it's about how isHashLikeValue is being used in getValueRange. Regardless of whether we use any_of or all_of here, it is *not* safe to return VALRNG_LONG if isHashLikeValue(V). If we return VALRNG_LONG, that is a *promise* that the value is long, and we take advantage of this explicit promise. isHashLikeValue is *not* a promise that the value is long; it's just a heuristic. So that whole thing needs to change. Please also add tests to catch this tricky bug.

(Perhaps you were accidentally working around this bug in getValueRange by using all_of here and that's the source of this confusion.)

This revision now requires changes to proceed.Mar 5 2017, 11:01 AM
n.bozhenov updated this revision to Diff 93007.Mar 24 2017, 2:07 PM
n.bozhenov edited edge metadata.

I believe that the new version of the heuristic is much more robust than the previous one. However, a naive implementation didn't work initially. It turned out that hash values weren't recognized because of PHI nodes with undef incoming values generated by loop-unroll. I had to amend the patch to work around the problem.

It doesn't look like you added any "unit" tests to cover this changed behavior. Was that intentional?

I slightly updated a few tests for a newer heuristic, but generally the tests stay the same. It was intentional, yes.

Ultimately I'm not comfortable LGTM'ing this patch so long as it contains this -O3 test. This is very different than how I understood we write tests. But that might just be my ignorance speaking; what you're trying to do may in fact be perfectly fine; I really don't know enough to say for sure. Let's get a second opinion? If @hfinkel or someone thinks this is the right approach, I'm happy.

Ok. I have removed the test in question. Probably I kind of abused the lit testing infrastructure.

n.bozhenov marked 2 inline comments as done.Mar 24 2017, 2:23 PM
n.bozhenov added inline comments.
lib/Transforms/Utils/BypassSlowDivision.cpp
230 ↗(On Diff #90608)

In the current implementation there are only PHINodes, but in the future it could be extended to cover SELECT instructions as well. Not sure if we should put Phi into the name.

231 ↗(On Diff #90608)

Why the change from any_of to all_of? You said earlier:

This way it would also be safe (no false positives) to increase the search depth and probably make it unlimited.

I am not sure what you mean by this. Are you making two claims about this change:

a. It's "safe" in that it reduces our false positives, and
b. It lets us safely increase the search depth to infinity?

Yes, I'm making both of these claims. Previously I considered a PHINode to be hash-like if any of its inputs were hash-like. If I made a search depth unlimited, a multiplication by a long constant would disable bypassing for all divisions reachable from it via a chain of PHINodes. Even if at runtime another input value is fed into those divisions.

But in the current implementation I require all incoming values for a PHINode to be hash-like. It is a much stricter restriction. And because of that, it is safe (conservative) to increase search depth. One multiplication operation cannot poison some far division operation.

I don't see how the switch to any_of changes anything with respect to the cost of the recursion. I agree that pretty much any sane code is not going to have deep recursion here. But that's not what I'm worried about. I'm worried about some crazy code that causes us to have quadratic behavior. *That's* what the depth limit is there to protect against.

As we stop traversal of the IR after finding first value that is neither PHINode nor MUL/XOR, the number of traversed instructions is small for any non-malicious IR. And the recursion depth is limited by the length of the longest chain of consecutive PHINodes.

With respect to (a) I see "safety" opposite of how you've described it. If we declare that something is hash-like, we won't optimize it, and not optimizing is the "safe" thing to do. Therefore the safe/conservative choice is to declare something as "maybe hash-like". Therefore any_of is safer / more conservative than all_of.

My point is that without this patch all long divisions are bypassed. And generally it is a very useful optimization. But in some rare cases this optimization is useless (not harmful) and we would like to disable it. Such disabling is quite risky, because we may get significant performance degradation not bypassing a wrong division. So, here "safety" means "bypass when in doubt".

I do have a major safety concern here, but it's about how isHashLikeValue is being used in getValueRange. Regardless of whether we use any_of or all_of here, it is *not* safe to return VALRNG_LONG if isHashLikeValue(V). If we return VALRNG_LONG, that is a *promise* that the value is long, and we take advantage of this explicit promise. isHashLikeValue is *not* a promise that the value is long; it's just a heuristic. So that whole thing needs to change. Please also add tests to catch this tricky bug.

(Perhaps you were accidentally working around this bug in getValueRange by using all_of here and that's the source of this confusion.)

Actually, the comment at VALRNG_LONG declaration says that it is not a promise. VALRNG_LONG means that a value is *unlikely* to fit into the shorter type. If we needed a promise, we would use ValueTracking only and would never introduce an additional analysis in isHashLikeValue. But for this particular optimization it doesn't matter if a value is *guaranteed* to be long or if it is just *likely* to be long.

jlebar requested changes to this revision.Mar 25 2017, 1:14 PM

Actually, the comment at VALRNG_LONG declaration says that it is not a promise.

Ah, you're right. I thought that the code that did "no division is needed at all: The quotient is 0 and the remainder is equal to Dividend." checked VALRNG_LONG, but it doesn't.

Regardless of what the comments say, I think this reveals that it is very confusing that we have VALRNG_SHORT, which is a promise, and VALRNG_LONG, which is not.

Maybe we should rename them to VALRNG_KNOWN_SHORT and VALRNG_LIKELY_LONG?

I slightly updated a few tests for a newer heuristic, but generally the tests stay the same. It was intentional, yes.

I can't approve a patch if its tests don't cover its behavior, sorry. If the behavior was worth changing between versions of the patch, it needs test coverage.

As we stop traversal of the IR after finding first value that is neither PHINode nor MUL/XOR, the number of traversed instructions is small for any non-malicious IR. And the recursion depth is limited by the length of the longest chain of consecutive PHINodes.

It's a design goal of LLVM to handle even pathological IR. I think this applies here. I cannot approve this patch if it blows up on IR with many consecutive phi nodes, sorry.

lib/Transforms/Utils/BypassSlowDivision.cpp
230 ↗(On Diff #93007)

if (!Visited.insert(I).second)

This revision now requires changes to proceed.Mar 25 2017, 1:14 PM
n.bozhenov updated this revision to Diff 93086.Mar 26 2017, 3:31 PM
n.bozhenov edited edge metadata.

Actually, the comment at VALRNG_LONG declaration says that it is not a promise.

Ah, you're right. I thought that the code that did "no division is needed at all: The quotient is 0 and the remainder is equal to Dividend." checked VALRNG_LONG, but it doesn't.

Regardless of what the comments say, I think this reveals that it is very confusing that we have VALRNG_SHORT, which is a promise, and VALRNG_LONG, which is not.

Maybe we should rename them to VALRNG_KNOWN_SHORT and VALRNG_LIKELY_LONG?

Done.

I slightly updated a few tests for a newer heuristic, but generally the tests stay the same. It was intentional, yes.

I can't approve a patch if its tests don't cover its behavior, sorry. If the behavior was worth changing between versions of the patch, it needs test coverage.

I have added one test case which is handled differently by the new and old versions of the patch.

As we stop traversal of the IR after finding first value that is neither PHINode nor MUL/XOR, the number of traversed instructions is small for any non-malicious IR. And the recursion depth is limited by the length of the longest chain of consecutive PHINodes.

It's a design goal of LLVM to handle even pathological IR. I think this applies here. I cannot approve this patch if it blows up on IR with many consecutive phi nodes, sorry.

I have limited the number of visited PHINodes for a single division. Now no more than 16 PHINodes are traversed before giving up.

This revision is now accepted and ready to land.Mar 27 2017, 9:15 AM
This revision was automatically updated to reflect the committed changes.