Page MenuHomePhabricator

[SimplifyCFG] Use lookup tables when they are more space efficient or a huge speed win.
Needs RevisionPublic

Authored by shawnl on Apr 22 2019, 2:55 PM.



I am trying to move towards much more space-efficient switch statements, using popcnt, as described in PR39013. This is the patch 6 towards that goal, and a continuation of D60673.

There are quite a few cases where the lookup table was smaller and it was still not used. Also, there was no consideration of how large the cases were in the calculation. These numbers will change in a later patch, (when sparse maps make switch much more compact) so we shouldn't argue too much about this cut-off.

When only an i8 is being switched over, a complete table is not large, and avoiding the branch of a regular lookup table is a significant speed win (as can be seen in the tests).

Diff Detail

Event Timeline

shawnl created this revision.Apr 22 2019, 2:55 PM
shawnl updated this revision to Diff 196141.Apr 22 2019, 3:13 PM
shawnl edited the summary of this revision. (Show Details)

add more test results

shawnl updated this revision to Diff 196440.Apr 24 2019, 6:38 AM

also remove Sub (turns out Sub does not get normalized to Add like I thought it would.

lebedev.ri set the repository for this revision to rL LLVM.
lebedev.ri removed a subscriber: lebedev.ri.
shawnl updated this revision to Diff 196637.Apr 25 2019, 8:03 AM

fix i8 use of covered table to work when the return is not i8, do not use -O2 in tests. Update tests.

jmolloy requested changes to this revision.Apr 25 2019, 9:00 AM

Please reupload with full context as described in the developer's guide.

These numbers will change in a later patch, (when sparse maps make switch much more compact) so we shouldn't argue too much about this cut-off.

Are you referring to your changes in builtins/? If so, that's surely part of compiler-rt, and you can't expect targets to have that available at runtime.


( 3 * 8 )

... but why? This isn't obvious from the code so needs comments. Note that the original code randomly dividing by 10 was just as bad, but you've clearly got a good reason for this change so should document it ;)



Bold claim :)


Nit: please use the term "Largest". They're synonymous and I know this is nitpicking, but it's the more generally used term.


This is only used in one place; please fold into its use.


Please describe this heuristic in more detail. Why is it always good?

Take a look at the line 5162 in the diffbase for a good example of a heuristic description. It captures what the criterion is, and a rationale for it.


These magic numbers have no place in SimplifyCFG. If you need to be this accurate, add a TTI hook.


But why? You've replaced 40% with 33% and you mention 64-bit integers but the following heuristic doesn't use 64-bit integers anywhere.


Please upload the code for review as it will be committed.


This assumes whatever constant add/sub/xor was planted still exists. There's no guarantee of that; for example if V is a constant, IRBuilder would have constant folded immediately.

In general unstitching code like this is inherently problematic. Note that isn't the same as planting an inverting code sequence - that's fine. We can use the optimizer to clean the duplicated stuff up and everyone is happy. That is much preferable to hamhandedly unstitching stuff and looking for patterns of code you've just emitted. It also adds an implicit contract between different parts of SimplifyCFG that I guarantee someone will miss when they update this code :)


This comment doesn't parse for me.


This code structure is a little hard to understand. Instead of trying to tack onto the end of the if statement to reuse the heuristic, can you extract the heuristic into a bool and reuse it explicitly here? Then the reader doesn't need to think about context.


It's not clear to me why this matters from this context. Perhaps if you wrote something like:

// Call ReduceSwitchRange *after* SwitchToLookupTable as SwitchToLookupTable calls this internally.

This revision now requires changes to proceed.Apr 25 2019, 9:00 AM
shawnl marked 2 inline comments as done.Apr 25 2019, 11:51 AM

Thanks for the review!

I've split the other patch into 5 distinct patches, which I will submit once I run the tests.

There is one part of this review that I need some clarification on before I can rev this patch. (see below comment)


The Xor wasn't added by this stuff. The problem is that this pass gets run multiple times, sometimes without the table generation (because it can make the code analyzable). There is nowhere else this optimization can go, because it is an optimization specific to switch statements, where the operands can be re-ordered arbitrarily. Also, there is no assumption that these things are there. It simply sees if they are there and if so removed them and then returns true so that the other optimizations run before we continue.

So I feel this is a planted inverting code sequence: this is not "unstitching" (and the code does not generate xors but does remove them): it is an optimization specific to covered tables.

jmolloy added inline comments.Apr 25 2019, 12:05 PM

My worry about this code is that it doesn't assert the preconditions or postconditions. AFAICS, it detects any Add, Sub or Xor and does some transform on them. Why is this correct in all cases? What if one of these instructions were missed? What if there is a sequence of these instructions, is it correct to just adjust a subset? What if they were reordered or constant folded?

The code doesn't explain exactly what it's checking for, who generates it and why it's correct to remove (and what happens if you have a false negative or false positive).

Note I'm not saying it's wrong, or perhaps it's the only way to do this, but with the lack of comment it remains a code smell at the moment and it's difficult for me to suggest an alternative.

nikic added inline comments.Apr 25 2019, 12:17 PM

I think the idea here is that given two bijective (total) functions f(x) and lookup(y), then lookup(f(x)) is also a bijective function that can be implemented as a new lookup table lookup_f(x). In this case and, sub and xor with a constant are the f(x)s.

It seems like this could be an independent and more general optimization though, that works on any table lookup.

hans added a comment.Apr 26 2019, 1:58 AM

I'm trying to follow along here, but there's so much churn I'm not sure what I'm supposed to be reviewing?

Are you asking for review on this patch? Is it ready for review? Or should I look at the other five patches you alluded too, and if so should this be marked abandoned?

shawnl planned changes to this revision.Apr 26 2019, 12:12 PM

This patch got a pretty comprehensive review and I will submit a new version for review. The other patches are also live.

shawnl updated this revision to Diff 197604.May 1 2019, 11:36 AM
shawnl edited the summary of this revision. (Show Details)
shawnl marked an inline comment as done.
shawnl added inline comments.

I am working on doing this as a general optimization of GEPs.

shawnl edited the summary of this revision. (Show Details)May 1 2019, 11:37 AM
shawnl updated this revision to Diff 197625.May 1 2019, 1:17 PM


jmolloy accepted this revision.May 2 2019, 1:11 AM

This seems reasonable to me.

This revision is now accepted and ready to land.May 2 2019, 1:11 AM
jmolloy requested changes to this revision.May 2 2019, 1:12 AM

Undo approval; was looking at an incorrect diff.

This revision now requires changes to proceed.May 2 2019, 1:12 AM
shawnl requested review of this revision.May 2 2019, 5:04 PM
jmolloy requested changes to this revision.May 3 2019, 7:47 AM

Looking much better. I think the TTI hook could be described better.


What is the TableSize, and what is the CaseSize? In particular why would TableSize ever be different from NumCases?

This revision now requires changes to proceed.May 3 2019, 7:47 AM
jmolloy requested changes to this revision.May 9 2019, 6:11 AM
jmolloy added inline comments.

To be honest, I'd really just express this as:

// Return true if the given SwichInst should be converted into a lookup table. The size of the lookup table is \c TableSize and the number of covered cases is \c NumCases (meaning the number of table entries that hit the default case is \c TableSize - \c NumCases).
bool shouldBuildLookupTable(SwitchInst *SI, uint64_t TableSize, unsigned NumCases);

Sorry for the churn here, I know I didn't mention this in your previous review. But passing the SwitchInst itself and letting the target fish out OptSize/SI->getType()->getIntegerBitWidth() seems cleaner.

Also, never use size_t or uint32_t here. TableSize's maximum extent isn't host dependent, it's target dependent. Use uint64_t to guarantee 64-bit coverage on all hosts. Similarly NumCases isn't limited to 32-bits, so use unsigned because that's what we use everywhere.

This revision now requires changes to proceed.May 9 2019, 6:11 AM
shawnl marked an inline comment as done.May 9 2019, 9:36 AM
shawnl added inline comments.

good catch on the size_t!

Similarly NumCases isn't limited to 32-bits, so use unsigned because that's what we use everywhere.

I was going to limit it to 32-bits in the next patch. Is that a problem?

jmolloy added inline comments.May 10 2019, 5:52 AM

I don't see the rationale for limiting it to 32-bits, so let's argue that in your next patch :)

In the meantime let's keep it to uint64_t or unsigned in this patch.

shawnl marked an inline comment as done.Sun, May 19, 5:15 AM
shawnl added inline comments.

Along with the comment on the multiplication overflow above: this is basic grade-school multiplication. 1/3 == 33 %; 64 / 8 == 8.

The heuristic has threatened to de-rail this patch set with bike-shedding, and has made me really frustrated.

Communication is a key part of doing this work, but when you worry that multiplication will not be understood by the reviewer it really throws a wrench in the process.

jmolloy added inline comments.Sun, May 19, 7:10 AM

Hi Shawn,

I apologise that my review has made you frustrated. If you wish to find a different reviewer, that's fine by me.

I have approximate knowledge of many things. Luckily multiplication is one of them. Clairvoyance is not, and the point of a code review is to ensure that submitted code can be comprehended by anyone, not just its author.

Communication is important as you say, and it's possible that I've been poor in my own communication. My intent with the "But why?" comments wasn't to say "I cannot understand this at all", instead to say "It took me longer than it should to understand this.". The latter does not mean the code is wrong, it means it either requires a comment or refactoring to make the intended logic (not the eventual mechanics) clear.

My view on code in heuristics is that it should be crystal clear what the intent is without any mental gymnastics.

If we take the code as you've rewritten it:

// The table density should be at least 1/3rd (33%) for 64-bit integers.
// This is a guess. FIXME: Find the best cut-off.
return (uint64_t)NumCases * 3 * (64 / 8) >= TableSize * CaseSize;

There are a few things the reader needs to do here. They must realise that the comment describes density with a number (33%), but the code calculates the inverse (denominator on the LHS, numerator on the RHS). It's still not clear to me what the "for 64-bit integers" means in the comment or precisely how it impacts the heuristic function.

Heuristics are always arbitrary, often wrong. They are the trickiest pieces of code to comprehend the intent of because of this, so I pay them more attention.

Also, you changed from 40% to 33% alongside this. Was there a reasoning behind this? People who notice the effects of this change may look back at this review to find a rationale.

Again, apologies if you find this pedantic. I always aim to be anything but pedantic in my reviews.

hans added inline comments.Wed, May 22, 1:42 AM

To give some background, the 40% density was originally chosen to match the density used to form switch jump tables in SelectionDAGBuilder. That density is still used for -Os builds, see OptsizeJumpTableDensity in TargetLoweringBase.cpp.

Changing the threshold here may well be a good idea, especially since it has dropped to 10% for non-optsize jump tables (JumpTableDensity in TargetLoweringBase.cpp), but such a change needs at least a comment with the motivation.

Perhaps it could be synced with the jump table density? Or perhaps it could be left alone for now -- there's already a lot going on in this patch.

Currently you're returning false for optsize functions. If a switch is >40% dense it means we will instead build a jump table, which is likely to be larger so this would be both a size and performance regression.

Regarding the "randomly dividing by 10" before, that's my fault too.

The code currently looks like:

// The table density should be at least 40%. This is the same criterion as for
// jump tables, see SelectionDAGBuilder::handleJTSwitchCase.
// FIXME: Find the best cut-off.
return SI->getNumCases() * 10 >= TableSize * 4;

If I were to write that today I might have spelled it out with *100 and *40 instead to make it really clear that 40 is a percentage, but I optimized prematurely.

The "TableSize >= UINT64_MAX / 10" check above is to protect against overflow, as the comment says, but as the code evolved and that line moved further away from the density computation, it became harder to make the connection.

I mentioned before that I have trouble following along with this patch series. This one has a title of "Use lookup tables when they are more space efficient or a huge speed win" which is pretty vague, and the inline description says things like "trying to move towards much more space-efficient switch statements, using popcnt" and "These numbers will change in a later patch, (when sparse maps make switch much more compact) so we shouldn't argue too much about this cut-off".

This makes the patch very hard to review. At least for me, it's not clear what you're trying to do.

Making the switch-to-lookup table transformation better is excellent, but it needs to be done by well-argued changes in clear and focused patches.

Thanks for the context Hans.

This review has been fairly critical, so I wanted to suggest ways to make this patch land easier.

  1. Split it up into the NFC moving of ReduceSwitchRange to reduce code churn in this patch and make it more accessible to review. This NFC patch would land almost instantly without contention.
  2. A general rule of thumb is to split up refactoring, new functionality, and heuristic changes into different patches. In particular the former two are easier to review and approve than the latter, so if the latter is totally isolated it's much less likely to derail your other changes.
  3. I've previously explained how to make the heuristic self describing. As Hans mentioned, premature optimizations like folding constant divisions/multiplications in the code can obscure the intended calculation.

This way, the refactor and new functionality could land without much contention and the heuristic can be (a) picked over easier, (b) isolated for benchmarking and (c) isolated for rollback.