This is an archive of the discontinued LLVM Phabricator instance.

[JumpTables][PowerPC] Let targets decide which switch instructions are suitable for jump tables
AbandonedPublic

Authored by nemanjai on Dec 8 2017, 12:27 PM.

Details

Summary

As currently implemented, the decision whether a particular switch instruction is suitable for a jump table is made in the base class of TargetLowering. So the only control a target has is to allow or disallow emission of jump tables.
However, there is at least one target (PowerPC) and possibly others for which that decision depends on a bit more context than just the range and density.

This patch allows the targets to override these functions as well as provides the PowerPC overrides.

The PPC specific portion:

  • No jump tables for small switches - there's a lot of overhead to set up a jump table due to use of the CTR and high-latency sign-extending load
  • Avoid chaining moves to the CTR and branches on the CTR (when the target of a case has an indirect call, a CTR loop or another possible jump table)
  • Avoid multiple CTR branches close together to avoid the possibility of branch aliasing (when the block terminated by the switch has an indirect call or explicit use of the CTR)

Diff Detail

Repository
rL LLVM

Event Timeline

nemanjai created this revision.Dec 8 2017, 12:27 PM
nemanjai updated this revision to Diff 126194.Dec 8 2017, 12:29 PM

Early exit if the target independent version determines the switch isn't suitable for a jump table anyway.

efriedma added inline comments.
lib/Target/PowerPC/PPCISelLowering.cpp
1183

Not thread-safe. (We support using multiple LLVMContexts at the same time in a process.)

1209

This is worst-case quadratic (N switches times M instructions in the successor); you might need to cache the mightUseCTR computation.

nemanjai added inline comments.Dec 8 2017, 1:45 PM
lib/Target/PowerPC/PPCISelLowering.cpp
1183

Ah, OK. I'll make it a data member of the class.

1209

Well, this is linear in the sum of instructions in successors of all the cases in the switch. This is since no block is visited twice. If a CTR use is found in a block, we return so definitely won't re-vist it. If no CTR use is found in a block, it is not visited again because we skip it in the if block above.

I don't think I can make it any faster than that since I have to check each instruction for using the CTR.

Or did you have something else in mind? Perhaps you mean across multiple invocations of isSuitableForJumpTable() on the same SwitchInst?

efriedma added inline comments.Dec 8 2017, 2:27 PM
lib/Target/PowerPC/PPCISelLowering.cpp
1156

These intrinsics come out of PPCCTRLoops, right? Would it make sense to avoid loop transform rather than disabling the jump table transform?

1209

I mean across multiple invocations of isSuitableForJumpTable for different switches (assuming we don't break the critical edge, which I don't think we do unconditionally).

Eli, thanks so much for your comments. Please keep them coming. I'm not trying to be difficult about any of this - I just want to make sure the final implementation does the right thing both for compile time and run time.
I probably should have noted in the description that this provides significant run time improvements on some important workloads.

lib/Target/PowerPC/PPCISelLowering.cpp
1156

That is correct, that should be the only way to end up with these intrinsics (other than my artifical test case below). We currently won't emit them if the loop contains a switch that might be lowered to a jump table. However, we don't check for whether the loop is within a switch.

I think we want to favour loops that use the CTR because they are quite efficient. There's a single move to the CTR and the condition for the back edge is usually a [generally] well predicted branch if the CTR is non-zero. Furthermore, we only transform loops that either have an unknown trip count or are known to have a somewhat large trip count. So we should rarely be in a situation where we've thwarted a jump table for a loop that will be less efficient as a CTR loop.

1209

Ah, now I see what you mean. Sure, computing this once and caching it the way I plan to cache KnownNoJTSwitches would certainly be an improvement there. I'll do that if there are no objections.

nemanjai updated this revision to Diff 126271.Dec 9 2017, 1:29 AM

Address comments from Eli. Thanks Eli!

  • Make caching of SwitchInsts already known not to be suitable for jump tables thread-safe
  • Add similar caching mechanism for results of mightUseCTR()
  • Fix a test case I initially forgot to fix
hans edited edge metadata.Dec 11 2017, 2:08 PM

Sorry for being slow to respond.

I can't comment on the PPC specifics, but I support allowing the target to decide whether to emit a jump table in this way.

lib/Target/PowerPC/PPCISelLowering.cpp
1146

Should the comment be with the declaration in the .h file instead?

efriedma edited edge metadata.Dec 11 2017, 2:23 PM

Generally looks fine, but probably needs a pass from a PPC reviewer.

lib/Target/PowerPC/PPCISelLowering.h
545

We generally prefer SmallPtrSet for sets of pointers. And for long-lived pointers to LLVM values, it's generally a good idea to use AssertingVH (so something like SmallPtrSet<AssertingVH<SwitchInst>>).

We share the same PPCTargetLowering for multiple functions, so you'll build up a set for the entire module in most cases. But I guess that isn't really a problem.

echristo accepted this revision.Dec 11 2017, 4:30 PM

One inline comment then LGTM.

-eric

lib/Target/PowerPC/PPCISelLowering.h
599–601

This seems randomly chosen. How'd we pick 7? Can we describe it in a comment?

This revision is now accepted and ready to land.Dec 11 2017, 4:30 PM
hfinkel added inline comments.Dec 12 2017, 10:05 PM
lib/Target/PowerPC/PPCISelLowering.cpp
1167

I feel like any time this would be true in a block terminated by a switch is a bug, and not using a jump table at all just because some case happens to jump to a loop pre-header seems very severe. Maybe we just want to raise the number of entries in the table in that case?

lib/Target/PowerPC/PPCISelLowering.h
545

Please use a SmallPtrSet here. In LLVM, we only use std::set if we need some property it offers (e.g, ordered iteration or stable addressing). Otherwise, we don't. That actually makes the semantics clearer, but only if we're consistent.

nemanjai added inline comments.Dec 13 2017, 2:31 AM
lib/Target/PowerPC/PPCISelLowering.cpp
1167

I certainly agree with your first comment. We should never be in a situation where these intrinsics are used in a loop terminated by a switch (as I've mentioned in my response to Eli above). However, I didn't find it pertinent to add a special case here to only check this if we're looking at a case successor. How about I just add an assert to that end and remove the test case that artificially adds this?

As far as the second comment, I think that the sequence of a case that jumps to a loop pre-header is certainly something we want to avoid if the loop trip count is too small to hide the latency of mtctr -> bctr -> mtctr. However, I think that requires more analysis than we want to do here. Perhaps the right approach would be to not emit a jump table if for example more than N% of the cases jump to a loop pre-header.
However, if it's OK with you, I'd prefer to keep the logic simple on this initial patch. So I'd like to opt for either the current behaviour or just removing the intrinsic checks altogether (perhaps with a FIXME in a comment). What do you think?

lib/Target/PowerPC/PPCISelLowering.h
545

Yes. I'll certainly change it to SmallPtrSet. Also, thanks for pointing out AssertingVH Eli. I was not aware of its existence.

599–601

How about if I were to add something like:

// The lwax -> mtctr sequence takes 11 cycles whereas each
// cmpli takes 2 cycles and can dispatch 4-wide. Assuming the
// worst case - last branch is taken, we can (theoretically) complete
// 6 cmpli's in 3 cycles and then the 6 branches in 8 cycles with a
// single cycle overlap for a total of 10 cycles.

Although I find that a little too detailed.

hfinkel added inline comments.Dec 13 2017, 8:02 AM
lib/Target/PowerPC/PPCISelLowering.cpp
1167

I certainly agree with your first comment. We should never be in a situation where these intrinsics are used in a loop terminated by a switch (as I've mentioned in my response to Eli above). However, I didn't find it pertinent to add a special case here to only check this if we're looking at a case successor. How about I just add an assert to that end and remove the test case that artificially adds this?

That's fine.

... a FIXME in a comment). What do you think?

Hrmm. The issue is that I think we always want to do this (i.e., for all checks here). Moreover, I think that not doing this could cause a significant regression for someone. Is doing this just a matter of also comparing against NumCases in isSuitableForJumpTable against some threshold?

I'm assuming that, based on your logic above, and if I assume that the ctctr -> bctr -> mtctr has an apx. 5 cycle penalty, then during those 5 cycles I could have done around 5 branches. If I assume, not a comparison to a linear sequence of checks, but to a binary search tree (which I believe we generate), then the break-even point would be around 2^5 = 32 cases. Thus, a threshold of ~64 cases would make sense. Does this reasoning make sense to you?

lib/Target/PowerPC/PPCISelLowering.h
545

I'm not actually sure that an AssertingVH will work here if it ends up used via TTI (where the IR is not invariant). You may need to use a value handle with a callback that removes itself from the sets upon delete (e.g., the way that the AssumptionCache / ValueMap works).

599–601

I don't. This is good. Please also say that this number was estimated for the P8 (or P9, as the case may be).

Given that the plan to make the functions in TargetLowering virtual is generally accepted, could you commit that part? This would solve a problem we're facing in some downstream code.

Given that the plan to make the functions in TargetLowering virtual is generally accepted, could you commit that part? This would solve a problem we're facing in some downstream code.

I'll split this out and get it committed.

I've finally gotten around to benchmarking this. I'll try to attach the full summary (warning: it is a large PDF file) and here is a somewhat concise summary of what I've done to benchmark this:

  1. Varied the following parameters
    • Number of cases
    • Density of each cluster (this is just the relationship between the range of values and the number of cases - i.e. 20 cases with density 10% will have values ranging from 0-200 with cases 0, 10, 20, 30, etc.)
    • Number of clusters (how many dense sections the number of cases is split into)
    • Complexity (this is just a measure of what percentage of cases are "complex" vs. "simple" - a simple case is just a return of an integer whereas a complex case is function calls)
    • The "Combination" field is composed of these parameters in order (i.e. 30.10.2.100 is 30 cases, density 10%, broken up into 2 clusters with 100% complexity)
  2. For complex cases, only direct and indirect calls were used. It might be interesting to also perform this exercise with "complex" cases being loops (perhaps CTR-eligible loops).
  3. Ran each binary built 3 times on a Power9 system and collected the number of cycles with the value being switched on being modulo the number of cases. Each run was bound to the same physical CPU that the kernel threads generally don't run on in order to reduce noise. The number of cycles was then averaged and compared to produce a "winner" (i.e. the binary that runs for fewer cycles).
  4. Everything was dumped into a CSV file

Findings
Jump tables tend to win for very specific cases whereas turning them off is a win in general.

  • Switch statements with a very low complexity tend to work better with jump tables
  • Small switch statements generally work better without jump tables
  • When jump tables win, the gap tends to be small except when the switch statement has a very low complexity

Note:
Understandably, some combinations of the parameters don't make a lot of sense. For example, with a very low number of cases, the rest of the parameters don't make much difference.

I'm ok with this, might want to wait on efriedma and hfinkel to take a look.

Could you go through the review comments and make sure you've addressed them?

Could you go through the review comments and make sure you've addressed them?

Yes, I'm sorry I am certainly planning to address those comments and post another review. I just keep getting interrupted with other things that I need to do. Sorry that I neglected to mention in my last comment that there's at least another revision coming to address the comments from Hal, Hans and yourself.

No rush. :) Just wanted to make sure we're on the same page.

nemanjai updated this revision to Diff 132158.Jan 31 2018, 6:05 AM

Benchmarking has shown that we should be concerned with any type of call in the case statements equally (i.e. indirect calls are no worse than direct). Furthermore, large switch statements can be lowered to jump tables without much penalty even if there are calls.

As a result, the algorithm is now:

  • Always lower large switches to jump tables (i.e. switches over 64 cases)
  • If there are any calls in a smaller switch, don't lower it to a jump table

The comments should be addressed now.

nemanjai added inline comments.Jan 31 2018, 6:07 AM
lib/Target/PowerPC/PPCISelLowering.h
20

This isn't used currently but I was asked to use it. In any case, either this or #include <set> will be removed depending on which one I end up using.

566

I wasn't able to make SmallPtrSet work for my wrapper to CallbackVH. I think I would need to specialize pointer type traits in order to do this - if someone can point me to information about this, I'd appreciate it.

efriedma added inline comments.Jan 31 2018, 11:18 AM
lib/Target/PowerPC/PPCISelLowering.h
566

Your CallbackVH isn't a pointer, so SmallPtrSet won't work. You could probably use DenseSet, though. (See http://llvm.org/docs/ProgrammersManual.html#set-like-containers-std-set-smallset-setvector-etc )

Although, I'm sort of surprised you're running into issues with values getting deleted; we should only be calling this during isel.

nemanjai added inline comments.Feb 2 2018, 6:10 AM
lib/Target/PowerPC/PPCISelLowering.h
566

OK, thanks. I'll try to switch to DenseSet.

This is called from TTI during InlineCost calculations.

nemanjai updated this revision to Diff 134494.Feb 15 2018, 12:36 PM

Updated to use llvm::SmallSet instead of std::set.

@hfinkel Hi Hal, are you OK with this patch now?

nemanjai abandoned this revision.Jun 7 2022, 7:26 AM

This was never committed because it turned out not to be beneficial. Abandoning.

Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2022, 7:26 AM