This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
11
PPCISelLowering.h
12
PPCISelLowering.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
NoJTsDueToCTRUse.ll
-
mcm-5.ll

Differential D41029

[JumpTables][PowerPC] Let targets decide which switch instructions are suitable for jump tables
AbandonedPublic

Authored by nemanjai on Dec 8 2017, 12:27 PM.

Download Raw Diff

Details

Reviewers

echristo
hfinkel
bogner
kbarton
bkramer
hans
efriedma

Summary

As currently implemented, the decision whether a particular switch instruction is suitable for a jump table is made in the base class of TargetLowering. So the only control a target has is to allow or disallow emission of jump tables.
However, there is at least one target (PowerPC) and possibly others for which that decision depends on a bit more context than just the range and density.

This patch allows the targets to override these functions as well as provides the PowerPC overrides.

The PPC specific portion:

No jump tables for small switches - there's a lot of overhead to set up a jump table due to use of the CTR and high-latency sign-extending load
Avoid chaining moves to the CTR and branches on the CTR (when the target of a case has an indirect call, a CTR loop or another possible jump table)
Avoid multiple CTR branches close together to avoid the possibility of branch aliasing (when the block terminated by the switch has an indirect call or explicit use of the CTR)

Diff Detail

Repository: rL LLVM

Event Timeline

nemanjai created this revision.Dec 8 2017, 12:27 PM

Early exit if the target independent version determines the switch isn't suitable for a jump table anyway.

efriedma added a subscriber: efriedma.Dec 8 2017, 12:55 PM

efriedma added inline comments.

lib/Target/PowerPC/PPCISelLowering.cpp
1183	Not thread-safe. (We support using multiple LLVMContexts at the same time in a process.)
1209	This is worst-case quadratic (N switches times M instructions in the successor); you might need to cache the mightUseCTR computation.

nemanjai added inline comments.Dec 8 2017, 1:45 PM

lib/Target/PowerPC/PPCISelLowering.cpp
1183	Ah, OK. I'll make it a data member of the class.
1209	Well, this is linear in the sum of instructions in successors of all the cases in the switch. This is since no block is visited twice. If a CTR use is found in a block, we return so definitely won't re-vist it. If no CTR use is found in a block, it is not visited again because we skip it in the `if` block above. I don't think I can make it any faster than that since I have to check each instruction for using the CTR. Or did you have something else in mind? Perhaps you mean across multiple invocations of `isSuitableForJumpTable()` on the same `SwitchInst`?

efriedma added inline comments.Dec 8 2017, 2:27 PM

lib/Target/PowerPC/PPCISelLowering.cpp
1156	These intrinsics come out of PPCCTRLoops, right? Would it make sense to avoid loop transform rather than disabling the jump table transform?
1209	I mean across multiple invocations of isSuitableForJumpTable for different switches (assuming we don't break the critical edge, which I don't think we do unconditionally).

Eli, thanks so much for your comments. Please keep them coming. I'm not trying to be difficult about any of this - I just want to make sure the final implementation does the right thing both for compile time and run time.
I probably should have noted in the description that this provides significant run time improvements on some important workloads.

lib/Target/PowerPC/PPCISelLowering.cpp
1156	That is correct, that should be the only way to end up with these intrinsics (other than my artifical test case below). We currently won't emit them if the loop contains a switch that might be lowered to a jump table. However, we don't check for whether the loop is within a switch. I think we want to favour loops that use the CTR because they are quite efficient. There's a single move to the CTR and the condition for the back edge is usually a [generally] well predicted branch if the CTR is non-zero. Furthermore, we only transform loops that either have an unknown trip count or are known to have a somewhat large trip count. So we should rarely be in a situation where we've thwarted a jump table for a loop that will be less efficient as a CTR loop.
1209	Ah, now I see what you mean. Sure, computing this once and caching it the way I plan to cache `KnownNoJTSwitches` would certainly be an improvement there. I'll do that if there are no objections.

nemanjai added a reviewer: efriedma.Dec 8 2017, 11:38 PM

Address comments from Eli. Thanks Eli!

Make caching of SwitchInsts already known not to be suitable for jump tables thread-safe
Add similar caching mechanism for results of mightUseCTR()
Fix a test case I initially forgot to fix

Sorry for being slow to respond.

I can't comment on the PPC specifics, but I support allowing the target to decide whether to emit a jump table in this way.

lib/Target/PowerPC/PPCISelLowering.cpp
1146	Should the comment be with the declaration in the .h file instead?

Generally looks fine, but probably needs a pass from a PPC reviewer.

lib/Target/PowerPC/PPCISelLowering.h
545	We generally prefer SmallPtrSet for sets of pointers. And for long-lived pointers to LLVM values, it's generally a good idea to use AssertingVH (so something like `SmallPtrSet<AssertingVH<SwitchInst>>`). We share the same PPCTargetLowering for multiple functions, so you'll build up a set for the entire module in most cases. But I guess that isn't really a problem.

One inline comment then LGTM.

-eric

lib/Target/PowerPC/PPCISelLowering.h
599–601	This seems randomly chosen. How'd we pick 7? Can we describe it in a comment?

This revision is now accepted and ready to land.Dec 11 2017, 4:30 PM

hfinkel added inline comments.Dec 12 2017, 10:05 PM

lib/Target/PowerPC/PPCISelLowering.cpp
1167	I feel like any time this would be true in a block terminated by a switch is a bug, and not using a jump table at all just because some case happens to jump to a loop pre-header seems very severe. Maybe we just want to raise the number of entries in the table in that case?
lib/Target/PowerPC/PPCISelLowering.h
545	Please use a SmallPtrSet here. In LLVM, we only use std::set if we need some property it offers (e.g, ordered iteration or stable addressing). Otherwise, we don't. That actually makes the semantics clearer, but only if we're consistent.

nemanjai added inline comments.Dec 13 2017, 2:31 AM

lib/Target/PowerPC/PPCISelLowering.cpp
1167	I certainly agree with your first comment. We should never be in a situation where these intrinsics are used in a loop terminated by a switch (as I've mentioned in my response to Eli above). However, I didn't find it pertinent to add a special case here to only check this if we're looking at a case successor. How about I just add an assert to that end and remove the test case that artificially adds this? As far as the second comment, I think that the sequence of a case that jumps to a loop pre-header is certainly something we want to avoid if the loop trip count is too small to hide the latency of `mtctr -> bctr -> mtctr`. However, I think that requires more analysis than we want to do here. Perhaps the right approach would be to not emit a jump table if for example more than N% of the cases jump to a loop pre-header. However, if it's OK with you, I'd prefer to keep the logic simple on this initial patch. So I'd like to opt for either the current behaviour or just removing the intrinsic checks altogether (perhaps with a FIXME in a comment). What do you think?
lib/Target/PowerPC/PPCISelLowering.h
545	Yes. I'll certainly change it to `SmallPtrSet`. Also, thanks for pointing out `AssertingVH` Eli. I was not aware of its existence.
599–601	How about if I were to add something like: // The lwax -> mtctr sequence takes 11 cycles whereas each // cmpli takes 2 cycles and can dispatch 4-wide. Assuming the // worst case - last branch is taken, we can (theoretically) complete // 6 cmpli's in 3 cycles and then the 6 branches in 8 cycles with a // single cycle overlap for a total of 10 cycles. Although I find that a little too detailed.

hfinkel added inline comments.Dec 13 2017, 8:02 AM

lib/Target/PowerPC/PPCISelLowering.cpp
1167	I certainly agree with your first comment. We should never be in a situation where these intrinsics are used in a loop terminated by a switch (as I've mentioned in my response to Eli above). However, I didn't find it pertinent to add a special case here to only check this if we're looking at a case successor. How about I just add an assert to that end and remove the test case that artificially adds this? That's fine. ... a FIXME in a comment). What do you think? Hrmm. The issue is that I think we always want to do this (i.e., for all checks here). Moreover, I think that not doing this could cause a significant regression for someone. Is doing this just a matter of also comparing against NumCases in isSuitableForJumpTable against some threshold? I'm assuming that, based on your logic above, and if I assume that the `ctctr -> bctr -> mtctr` has an apx. 5 cycle penalty, then during those 5 cycles I could have done around 5 branches. If I assume, not a comparison to a linear sequence of checks, but to a binary search tree (which I believe we generate), then the break-even point would be around 2^5 = 32 cases. Thus, a threshold of ~64 cases would make sense. Does this reasoning make sense to you?
lib/Target/PowerPC/PPCISelLowering.h
545	I'm not actually sure that an AssertingVH will work here if it ends up used via TTI (where the IR is not invariant). You may need to use a value handle with a callback that removes itself from the sets upon delete (e.g., the way that the AssumptionCache / ValueMap works).
599–601	I don't. This is good. Please also say that this number was estimated for the `P8` (or `P9`, as the case may be).

Given that the plan to make the functions in TargetLowering virtual is generally accepted, could you commit that part? This would solve a problem we're facing in some downstream code.

In D41029#960780, @kparzysz wrote:

Given that the plan to make the functions in TargetLowering virtual is generally accepted, could you commit that part? This would solve a problem we're facing in some downstream code.

I'll split this out and get it committed.

I've finally gotten around to benchmarking this. I'll try to attach the full summary (warning: it is a large PDF file) and here is a somewhat concise summary of what I've done to benchmark this:

Varied the following parameters
- Number of cases
- Density of each cluster (this is just the relationship between the range of values and the number of cases - i.e. 20 cases with density 10% will have values ranging from 0-200 with cases 0, 10, 20, 30, etc.)
- Number of clusters (how many dense sections the number of cases is split into)
- Complexity (this is just a measure of what percentage of cases are "complex" vs. "simple" - a simple case is just a return of an integer whereas a complex case is function calls)
- The "Combination" field is composed of these parameters in order (i.e. 30.10.2.100 is 30 cases, density 10%, broken up into 2 clusters with 100% complexity)
For complex cases, only direct and indirect calls were used. It might be interesting to also perform this exercise with "complex" cases being loops (perhaps CTR-eligible loops).
Ran each binary built 3 times on a Power9 system and collected the number of cycles with the value being switched on being modulo the number of cases. Each run was bound to the same physical CPU that the kernel threads generally don't run on in order to reduce noise. The number of cycles was then averaged and compared to produce a "winner" (i.e. the binary that runs for fewer cycles).
Everything was dumped into a CSV file

Findings
Jump tables tend to win for very specific cases whereas turning them off is a win in general.

Switch statements with a very low complexity tend to work better with jump tables
Small switch statements generally work better without jump tables
When jump tables win, the gap tends to be small except when the switch statement has a very low complexity

Note:
Understandably, some combinations of the parameters don't make a lot of sense. For example, with a very low number of cases, the rest of the parameters don't make much difference.

JumpTablesBenchmark.pdf2 MBDownload

I'm ok with this, might want to wait on efriedma and hfinkel to take a look.

Could you go through the review comments and make sure you've addressed them?

In D41029#988493, @efriedma wrote:

Could you go through the review comments and make sure you've addressed them?

Yes, I'm sorry I am certainly planning to address those comments and post another review. I just keep getting interrupted with other things that I need to do. Sorry that I neglected to mention in my last comment that there's at least another revision coming to address the comments from Hal, Hans and yourself.

No rush. :) Just wanted to make sure we're on the same page.

Benchmarking has shown that we should be concerned with any type of call in the case statements equally (i.e. indirect calls are no worse than direct). Furthermore, large switch statements can be lowered to jump tables without much penalty even if there are calls.

As a result, the algorithm is now:

Always lower large switches to jump tables (i.e. switches over 64 cases)
If there are any calls in a smaller switch, don't lower it to a jump table

The comments should be addressed now.

nemanjai added inline comments.Jan 31 2018, 6:07 AM

lib/Target/PowerPC/PPCISelLowering.h
20	This isn't used currently but I was asked to use it. In any case, either this or `#include <set>` will be removed depending on which one I end up using.
566	I wasn't able to make `SmallPtrSet` work for my wrapper to `CallbackVH`. I think I would need to specialize pointer type traits in order to do this - if someone can point me to information about this, I'd appreciate it.

efriedma added inline comments.Jan 31 2018, 11:18 AM

lib/Target/PowerPC/PPCISelLowering.h
566	Your CallbackVH isn't a pointer, so SmallPtrSet won't work. You could probably use DenseSet, though. (See http://llvm.org/docs/ProgrammersManual.html#set-like-containers-std-set-smallset-setvector-etc ) Although, I'm sort of surprised you're running into issues with values getting deleted; we should only be calling this during isel.

nemanjai added inline comments.Feb 2 2018, 6:10 AM

lib/Target/PowerPC/PPCISelLowering.h
566	OK, thanks. I'll try to switch to DenseSet. This is called from TTI during `InlineCost` calculations.

Updated to use llvm::SmallSet instead of std::set.

@hfinkel Hi Hal, are you OK with this patch now?

This was never committed because it turned out not to be beneficial. Abandoning.

Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2022, 7:26 AM

Herald added subscribers: steven.zhang, shchenz. · View Herald Transcript

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCISelLowering.h

52 lines

PPCISelLowering.cpp

93 lines

test/

CodeGen/

PowerPC/

NoJTsDueToCTRUse.ll

258 lines

mcm-5.ll

4 lines

Diff 134494

lib/Target/PowerPC/PPCISelLowering.h

Show All 11 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_LIB_TARGET_POWERPC_PPCISELLOWERING_H		#ifndef LLVM_LIB_TARGET_POWERPC_PPCISELLOWERING_H
#define LLVM_LIB_TARGET_POWERPC_PPCISELLOWERING_H		#define LLVM_LIB_TARGET_POWERPC_PPCISELLOWERING_H

#include "PPC.h"		#include "PPC.h"
#include "PPCInstrInfo.h"		#include "PPCInstrInfo.h"
		#include "llvm/ADT/SmallSet.h"
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions This isn't used currently but I was asked to use it. In any case, either this or `#include <set>` will be removed depending on which one I end up using. nemanjai: This isn't used currently but I was asked to use it. In any case, either this or `#include…
#include "llvm/CodeGen/CallingConvLower.h"		#include "llvm/CodeGen/CallingConvLower.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineMemOperand.h"		#include "llvm/CodeGen/MachineMemOperand.h"
#include "llvm/CodeGen/MachineValueType.h"		#include "llvm/CodeGen/MachineValueType.h"
#include "llvm/CodeGen/SelectionDAG.h"		#include "llvm/CodeGen/SelectionDAG.h"
#include "llvm/CodeGen/SelectionDAGNodes.h"		#include "llvm/CodeGen/SelectionDAGNodes.h"
#include "llvm/CodeGen/TargetLowering.h"		#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/CodeGen/ValueTypes.h"		#include "llvm/CodeGen/ValueTypes.h"
▲ Show 20 Lines • Show All 507 Lines • ▼ Show 20 Lines	namespace PPC {
/// amount, otherwise return -1.		/// amount, otherwise return -1.
int isQVALIGNIShuffleMask(SDNode *N);		int isQVALIGNIShuffleMask(SDNode *N);

} // end namespace PPC		} // end namespace PPC

class PPCTargetLowering : public TargetLowering {		class PPCTargetLowering : public TargetLowering {
const PPCSubtarget &Subtarget;		const PPCSubtarget &Subtarget;

		struct JTCallBack final : public CallbackVH {
		JTCallBack(const Value V, const PPCTargetLowering TL = nullptr)
		efriedmaUnsubmitted Not Done Reply Inline Actions We generally prefer SmallPtrSet for sets of pointers. And for long-lived pointers to LLVM values, it's generally a good idea to use AssertingVH (so something like `SmallPtrSet<AssertingVH<SwitchInst>>`). We share the same PPCTargetLowering for multiple functions, so you'll build up a set for the entire module in most cases. But I guess that isn't really a problem. efriedma: We generally prefer SmallPtrSet for sets of pointers. And for long-lived pointers to LLVM…
		hfinkelUnsubmitted Not Done Reply Inline Actions Please use a SmallPtrSet here. In LLVM, we only use std::set if we need some property it offers (e.g, ordered iteration or stable addressing). Otherwise, we don't. That actually makes the semantics clearer, but only if we're consistent. hfinkel: Please use a SmallPtrSet here. In LLVM, we only use std::set if we need some property it offers…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Yes. I'll certainly change it to `SmallPtrSet`. Also, thanks for pointing out `AssertingVH` Eli. I was not aware of its existence. nemanjai: Yes. I'll certainly change it to `SmallPtrSet`. Also, thanks for pointing out `AssertingVH` Eli.
		hfinkelUnsubmitted Not Done Reply Inline Actions I'm not actually sure that an AssertingVH will work here if it ends up used via TTI (where the IR is not invariant). You may need to use a value handle with a callback that removes itself from the sets upon delete (e.g., the way that the AssumptionCache / ValueMap works). hfinkel: I'm not actually sure that an AssertingVH will work here if it ends up used via TTI (where the…
		: CallbackVH(const_cast<Value*>(V)), PPCTLI(TL) { }
		// If we RAUW, we don't know anything about the new value, so just
		// erase it from all the sets.
		void allUsesReplacedWith(Value *V) override {
		deleted();
		}
		void deleted() override {
		if (PPCTLI) {
		PPCTLI->KnownNoJTSwitches.erase(*this);
		PPCTLI->BlocksKnownToUseCTR.erase(*this);
		PPCTLI->BlocksKnownNotToUseCTR.erase(*this);
		}
		}
		private:
		const PPCTargetLowering *PPCTLI;
		};
		friend struct JTCallBack;

		mutable SmallSet<JTCallBack, 16> KnownNoJTSwitches;
		mutable SmallSet<JTCallBack, 16> BlocksKnownToUseCTR;
		mutable SmallSet<JTCallBack, 16> BlocksKnownNotToUseCTR;
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions I wasn't able to make `SmallPtrSet` work for my wrapper to `CallbackVH`. I think I would need to specialize pointer type traits in order to do this - if someone can point me to information about this, I'd appreciate it. nemanjai: I wasn't able to make `SmallPtrSet` work for my wrapper to `CallbackVH`. I think I would need…
		efriedmaUnsubmitted Not Done Reply Inline Actions Your CallbackVH isn't a pointer, so SmallPtrSet won't work. You could probably use DenseSet, though. (See http://llvm.org/docs/ProgrammersManual.html#set-like-containers-std-set-smallset-setvector-etc ) Although, I'm sort of surprised you're running into issues with values getting deleted; we should only be calling this during isel. efriedma: Your CallbackVH isn't a pointer, so SmallPtrSet won't work. You could probably use DenseSet…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions OK, thanks. I'll try to switch to DenseSet. This is called from TTI during `InlineCost` calculations. nemanjai: OK, thanks. I'll try to switch to DenseSet. This is called from TTI during `InlineCost`…
		// The threshold for the number of cases a switch should have for jump
		// tables to be profitable regardless of whether any cases have calls.
		static unsigned NumCasesForJTWithCalls;
public:		public:
explicit PPCTargetLowering(const PPCTargetMachine &TM,		explicit PPCTargetLowering(const PPCTargetMachine &TM,
const PPCSubtarget &STI);		const PPCSubtarget &STI);

/// getTargetNodeName() - This method returns the name of a target specific		/// getTargetNodeName() - This method returns the name of a target specific
/// DAG node.		/// DAG node.
const char *getTargetNodeName(unsigned Opcode) const override;		const char *getTargetNodeName(unsigned Opcode) const override;

Show All 9 Lines	TargetLoweringBase::LegalizeTypeAction getPreferredVectorAction(EVT VT)
const override {		const override {
if (VT.getScalarSizeInBits() % 8 == 0)		if (VT.getScalarSizeInBits() % 8 == 0)
return TypeWidenVector;		return TypeWidenVector;
return TargetLoweringBase::getPreferredVectorAction(VT);		return TargetLoweringBase::getPreferredVectorAction(VT);
}		}

bool useSoftFloat() const override;		bool useSoftFloat() const override;

		// A jump table is emitted as lwax -> mtctr -> bctr. All three of those
		// are high latency instructions, so we don't want to emit jump tables for
		// switches with fewer than 5 cases as the jump table overhead will outweigh
		// the benefit. Please note that a jump table will only be emitted for a
		// switch with so few cases if there are no calls or loop headers in any
		// of the case target blocks.
		// The lwax -> mtctr sequence takes 11 cycles whereas each
		echristoUnsubmitted Not Done Reply Inline Actions This seems randomly chosen. How'd we pick 7? Can we describe it in a comment? echristo: This seems randomly chosen. How'd we pick 7? Can we describe it in a comment?
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions How about if I were to add something like: // The lwax -> mtctr sequence takes 11 cycles whereas each // cmpli takes 2 cycles and can dispatch 4-wide. Assuming the // worst case - last branch is taken, we can (theoretically) complete // 6 cmpli's in 3 cycles and then the 6 branches in 8 cycles with a // single cycle overlap for a total of 10 cycles. Although I find that a little too detailed. nemanjai: How about if I were to add something like: ``` // The lwax -> mtctr sequence takes 11 cycles…
		hfinkelUnsubmitted Not Done Reply Inline Actions I don't. This is good. Please also say that this number was estimated for the `P8` (or `P9`, as the case may be). hfinkel: I don't. This is good. Please also say that this number was estimated for the `P8` (or `P9`, as…
		// cmpli takes 2 cycles and can dispatch 4-wide. Assuming the
		// worst case - last branch is taken and linear comparisons, we can
		// (theoretically) complete 6 cmpli's in 3 cycles and then the 6 branches
		// in 8 cycles with a single cycle overlap for a total of 10 cycles.
		// However, in practice (as confirmed by benchmarking on a Power9 machine)
		// a jump table can be the better choice for switches as small as 5 cases
		// as long as there are no calls in the target blocks.
		unsigned getMinimumJumpTableEntries() const override {
		return 5;
		}
		/// Does the basic block use the count register (CTR)? This will happen if
		/// the block has a call or is terminated by a switch that might be lowered
		/// to a jump table.
		bool mightUseCTR(const BasicBlock *BB) const;

		bool isSuitableForJumpTable(const SwitchInst *SI, uint64_t NumCases,
		uint64_t Range) const override;

MVT getScalarShiftAmountTy(const DataLayout &, EVT) const override {		MVT getScalarShiftAmountTy(const DataLayout &, EVT) const override {
return MVT::i32;		return MVT::i32;
}		}

bool isCheapToSpeculateCttz() const override {		bool isCheapToSpeculateCttz() const override {
return true;		return true;
}		}

▲ Show 20 Lines • Show All 562 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
static cl::opt<bool> DisablePPCUnaligned("disable-ppc-unaligned",		static cl::opt<bool> DisablePPCUnaligned("disable-ppc-unaligned",
cl::desc("disable unaligned load/store generation on PPC"), cl::Hidden);		cl::desc("disable unaligned load/store generation on PPC"), cl::Hidden);

static cl::opt<bool> DisableSCO("disable-ppc-sco",		static cl::opt<bool> DisableSCO("disable-ppc-sco",
cl::desc("disable sibling call optimization on ppc"), cl::Hidden);		cl::desc("disable sibling call optimization on ppc"), cl::Hidden);

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");
STATISTIC(NumSiblingCalls, "Number of sibling calls");		STATISTIC(NumSiblingCalls, "Number of sibling calls");
		STATISTIC(NumCallsToIsSuitableForJT,
		"Number of times a switch is checked for JT eligibility");
		STATISTIC(NumSkippedAlreadyExaminedSwitches,
		"Number of switches previously proven not eligible for JT");
		STATISTIC(NumSkippedAlreadyVisitedCases,
		"Number of cases not examined for CTR use due to previous visits");
		STATISTIC(NumSwitchesNotSuitableForJT,
		"Number of switch instructions not suitable for jump tables");

static bool isNByteElemShuffleMask(ShuffleVectorSDNode *, unsigned, int);		static bool isNByteElemShuffleMask(ShuffleVectorSDNode *, unsigned, int);

// FIXME: Remove this once the bug has been fixed!		// FIXME: Remove this once the bug has been fixed!
extern cl::opt<bool> ANDIGlueBug;		extern cl::opt<bool> ANDIGlueBug;

		unsigned PPCTargetLowering::NumCasesForJTWithCalls = 64;

PPCTargetLowering::PPCTargetLowering(const PPCTargetMachine &TM,		PPCTargetLowering::PPCTargetLowering(const PPCTargetMachine &TM,
const PPCSubtarget &STI)		const PPCSubtarget &STI)
: TargetLowering(TM), Subtarget(STI) {		: TargetLowering(TM), Subtarget(STI) {
// Use _setjmp/_longjmp instead of setjmp/longjmp.		// Use _setjmp/_longjmp instead of setjmp/longjmp.
setUseUnderscoreSetJmp(true);		setUseUnderscoreSetJmp(true);
setUseUnderscoreLongJmp(true);		setUseUnderscoreLongJmp(true);

// On PPC32/64, arguments smaller than 4/8 bytes are extended, so all		// On PPC32/64, arguments smaller than 4/8 bytes are extended, so all
▲ Show 20 Lines • Show All 995 Lines • ▼ Show 20 Lines	if (Subtarget.hasAltivec() \|\| Subtarget.hasQPX())
getMaxByValAlign(Ty, Align, Subtarget.hasQPX() ? 32 : 16);		getMaxByValAlign(Ty, Align, Subtarget.hasQPX() ? 32 : 16);
return Align;		return Align;
}		}

bool PPCTargetLowering::useSoftFloat() const {		bool PPCTargetLowering::useSoftFloat() const {
return Subtarget.useSoftFloat();		return Subtarget.useSoftFloat();
}		}

		bool PPCTargetLowering::mightUseCTR(const BasicBlock *BB) const {
		// Skip blocks known to use the CTR.
		if (BlocksKnownToUseCTR.count(BB)) {
		NumSkippedAlreadyVisitedCases++;
		hansUnsubmitted Not Done Reply Inline Actions Should the comment be with the declaration in the .h file instead? hans: Should the comment be with the declaration in the .h file instead?
		return true;
		}
		if (BlocksKnownNotToUseCTR.count(BB)) {
		NumSkippedAlreadyVisitedCases++;
		return false;
		}
		for (auto &Inst : *BB) {
		// Benchmarking shows that there really isn't any difference between
		// direct and indirect calls when it comes to jump table profitability.
		if (isa<CallInst>(&Inst)) {
		efriedmaUnsubmitted Not Done Reply Inline Actions These intrinsics come out of PPCCTRLoops, right? Would it make sense to avoid loop transform rather than disabling the jump table transform? efriedma: These intrinsics come out of PPCCTRLoops, right? Would it make sense to avoid loop transform…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions That is correct, that should be the only way to end up with these intrinsics (other than my artifical test case below). We currently won't emit them if the loop contains a switch that might be lowered to a jump table. However, we don't check for whether the loop is within a switch. I think we want to favour loops that use the CTR because they are quite efficient. There's a single move to the CTR and the condition for the back edge is usually a [generally] well predicted branch if the CTR is non-zero. Furthermore, we only transform loops that either have an unknown trip count or are known to have a somewhat large trip count. So we should rarely be in a situation where we've thwarted a jump table for a loop that will be less efficient as a CTR loop. nemanjai: That is correct, that should be the only way to end up with these intrinsics (other than my…
		BlocksKnownToUseCTR.insert(JTCallBack(BB, this));
		return true;
		} else if (const SwitchInst *SI = dyn_cast<SwitchInst>(&Inst)) {
		// Does the block have a switch instruction large enough to actually
		// be lowered to a jump table? If so, conservatively assume that one
		// will be lowered to a jump table.
		if ((SI->getNumCases() >= getMinimumJumpTableEntries())) {
		BlocksKnownToUseCTR.insert(JTCallBack(BB, this));
		return true;
		}
		}
		hfinkelUnsubmitted Not Done Reply Inline Actions I feel like any time this would be true in a block terminated by a switch is a bug, and not using a jump table at all just because some case happens to jump to a loop pre-header seems very severe. Maybe we just want to raise the number of entries in the table in that case? hfinkel: I feel like any time this would be true in a block terminated by a switch is a bug, and not…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions I certainly agree with your first comment. We should never be in a situation where these intrinsics are used in a loop terminated by a switch (as I've mentioned in my response to Eli above). However, I didn't find it pertinent to add a special case here to only check this if we're looking at a case successor. How about I just add an assert to that end and remove the test case that artificially adds this? As far as the second comment, I think that the sequence of a case that jumps to a loop pre-header is certainly something we want to avoid if the loop trip count is too small to hide the latency of `mtctr -> bctr -> mtctr`. However, I think that requires more analysis than we want to do here. Perhaps the right approach would be to not emit a jump table if for example more than N% of the cases jump to a loop pre-header. However, if it's OK with you, I'd prefer to keep the logic simple on this initial patch. So I'd like to opt for either the current behaviour or just removing the intrinsic checks altogether (perhaps with a FIXME in a comment). What do you think? nemanjai: I certainly agree with your first comment. We should never be in a situation where these…
		hfinkelUnsubmitted Not Done Reply Inline Actions I certainly agree with your first comment. We should never be in a situation where these intrinsics are used in a loop terminated by a switch (as I've mentioned in my response to Eli above). However, I didn't find it pertinent to add a special case here to only check this if we're looking at a case successor. How about I just add an assert to that end and remove the test case that artificially adds this? That's fine. ... a FIXME in a comment). What do you think? Hrmm. The issue is that I think we always want to do this (i.e., for all checks here). Moreover, I think that not doing this could cause a significant regression for someone. Is doing this just a matter of also comparing against NumCases in isSuitableForJumpTable against some threshold? I'm assuming that, based on your logic above, and if I assume that the `ctctr -> bctr -> mtctr` has an apx. 5 cycle penalty, then during those 5 cycles I could have done around 5 branches. If I assume, not a comparison to a linear sequence of checks, but to a binary search tree (which I believe we generate), then the break-even point would be around 2^5 = 32 cases. Thus, a threshold of ~64 cases would make sense. Does this reasoning make sense to you? hfinkel: > I certainly agree with your first comment. We should never be in a situation where these…
		}
		BlocksKnownNotToUseCTR.insert(JTCallBack(BB, this));
		return false;
		}

		// This function is called a huge number of times so we try to limit the number
		// of traversals of all basic blocks. Just as an example, a bootstrap build
		// calls this function 11,822,805 times on 7554 unique SwitchInst instances. We
		// conservatively re-check any SwitchInst's that were previously suitable for a
		// jump table to ensure that some transformation hasn't made it unsuitable.
		bool PPCTargetLowering::isSuitableForJumpTable(const SwitchInst *SI,
		uint64_t NumCases,
		uint64_t Range) const {
		NumCallsToIsSuitableForJT++;
		if (!TargetLoweringBase::isSuitableForJumpTable(SI, NumCases, Range))
		return false;
		efriedmaUnsubmitted Not Done Reply Inline Actions Not thread-safe. (We support using multiple LLVMContexts at the same time in a process.) efriedma: Not thread-safe. (We support using multiple LLVMContexts at the same time in a process.)
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Ah, OK. I'll make it a data member of the class. nemanjai: Ah, OK. I'll make it a data member of the class.

		// Jump tables are fine with very large switch statements.
		bool LargeSwitch = SI->getNumCases() >= NumCasesForJTWithCalls;
		if (LargeSwitch) {
		BlocksKnownToUseCTR.insert(JTCallBack(SI->getParent(), this));
		BlocksKnownNotToUseCTR.erase(SI->getParent());
		return true;
		}

		// If we already know that we don't want to lower this SwitchInst to
		// a jump table, no need to check further.
		if (KnownNoJTSwitches.count(SI)) {
		NumSkippedAlreadyExaminedSwitches++;
		return false;
		}

		// At this point we know we're not looking at a huge switch, so a CTR
		// use in any of the cases makes this SI not suitable for a jump table.
		// FIXME: a switch shouldn't use a jump table only if a case that is
		// actually visited has a use of the CTR. Perhaps branch probability
		// should be considered here in the future (i.e. if the branch probability
		// is way less than 1 / NumCases).
		for (auto It : SI->cases()) {
		if (mightUseCTR(It.getCaseSuccessor())) {
		NumSwitchesNotSuitableForJT++;
		KnownNoJTSwitches.insert(JTCallBack(SI, this));
		efriedmaUnsubmitted Not Done Reply Inline Actions This is worst-case quadratic (N switches times M instructions in the successor); you might need to cache the mightUseCTR computation. efriedma: This is worst-case quadratic (N switches times M instructions in the successor); you might need…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Well, this is linear in the sum of instructions in successors of all the cases in the switch. This is since no block is visited twice. If a CTR use is found in a block, we return so definitely won't re-vist it. If no CTR use is found in a block, it is not visited again because we skip it in the `if` block above. I don't think I can make it any faster than that since I have to check each instruction for using the CTR. Or did you have something else in mind? Perhaps you mean across multiple invocations of `isSuitableForJumpTable()` on the same `SwitchInst`? nemanjai: Well, this is linear in the sum of instructions in successors of all the cases in the switch.
		efriedmaUnsubmitted Not Done Reply Inline Actions I mean across multiple invocations of isSuitableForJumpTable for different switches (assuming we don't break the critical edge, which I don't think we do unconditionally). efriedma: I mean across multiple invocations of isSuitableForJumpTable for different switches (assuming…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Ah, now I see what you mean. Sure, computing this once and caching it the way I plan to cache `KnownNoJTSwitches` would certainly be an improvement there. I'll do that if there are no objections. nemanjai: Ah, now I see what you mean. Sure, computing this once and caching it the way I plan to cache…
		DEBUG(dbgs() << "In function: " <<
		SI->getParent()->getParent()->getName());
		DEBUG(dbgs() << "\nSwitch instruction:\n");
		DEBUG(SI->dump());
		DEBUG(dbgs() << "Not suitable for a jump table due to possible CTR "
		"use in successor:\n");
		DEBUG(It.getCaseSuccessor()->dump());
		return false;
		}
		}
		// The containing block now uses the CTR.
		BlocksKnownToUseCTR.insert(JTCallBack(SI->getParent(), this));
		BlocksKnownNotToUseCTR.erase(SI->getParent());
		return true;
		}

const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {		const char *PPCTargetLowering::getTargetNodeName(unsigned Opcode) const {
switch ((PPCISD::NodeType)Opcode) {		switch ((PPCISD::NodeType)Opcode) {
case PPCISD::FIRST_NUMBER: break;		case PPCISD::FIRST_NUMBER: break;
case PPCISD::FSEL: return "PPCISD::FSEL";		case PPCISD::FSEL: return "PPCISD::FSEL";
case PPCISD::FCFID: return "PPCISD::FCFID";		case PPCISD::FCFID: return "PPCISD::FCFID";
case PPCISD::FCFIDU: return "PPCISD::FCFIDU";		case PPCISD::FCFIDU: return "PPCISD::FCFIDU";
case PPCISD::FCFIDS: return "PPCISD::FCFIDS";		case PPCISD::FCFIDS: return "PPCISD::FCFIDS";
case PPCISD::FCFIDUS: return "PPCISD::FCFIDUS";		case PPCISD::FCFIDUS: return "PPCISD::FCFIDUS";
▲ Show 20 Lines • Show All 12,794 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/NoJTsDueToCTRUse.ll

				; RUN: llc -verify-machineinstrs -debug-only=ppc-lowering < %s > /dev/null 2>%t && FileCheck <%t %s
				; REQUIRES: asserts
				declare void @llvm.ppc.mtctr.i64(i64)

				define fastcc void @case2() {
				; CHECK: In function: case2
				; CHECK: Switch instruction:
				; CHECK: switch i32 %0, label %sw.default [
				; CHECK: Not suitable for a jump table due to possible CTR use in successor:
				; CHEC: sw.bb20:
				entry:
				switch i16 undef, label %sw.default [
				i16 54, label %sw.bb
				i16 55, label %sw.bb3
				i16 57, label %sw.bb14
				i16 68, label %sw.bb20
				i16 67, label %sw.bb27
				i16 58, label %sw.bb36
				i16 134, label %sw.bb45
				i16 61, label %sw.bb54
				i16 63, label %sw.bb82
				i16 62, label %while.body.i486
				i16 69, label %while.body.i499
				i16 59, label %while.body.i512
				i16 66, label %sw.bb102
				]

				sw.bb: ; preds = %entry
				unreachable

				sw.bb3: ; preds = %entry
				unreachable

				sw.bb14: ; preds = %entry
				unreachable

				sw.bb20: ; preds = %entry
				br label %land.rhs.i258

				land.rhs.i258: ; preds = %while.body.i262, %sw.bb20
				%cur.addr.019.i255.idx = phi i64 [ 1, %sw.bb20 ], [ %cur.addr.019.i255.add, %while.body.i262 ]
				br label %while.body.i262

				while.body.i262: ; preds = %land.rhs.i258
				%cur.addr.019.i255.add = add nuw nsw i64 %cur.addr.019.i255.idx, 1
				%cmp2.i261 = icmp ult i64 %cur.addr.019.i255.idx, 2045
				br i1 %cmp2.i261, label %land.rhs.i258, label %safe_concat.exit264

				safe_concat.exit264: ; preds = %while.body.i262
				unreachable

				sw.bb27: ; preds = %entry
				unreachable

				sw.bb36: ; preds = %entry
				unreachable

				sw.bb45: ; preds = %entry
				unreachable

				sw.bb54: ; preds = %entry
				unreachable

				sw.bb82: ; preds = %entry
				unreachable

				while.body.i486: ; preds = %entry
				unreachable

				while.body.i499: ; preds = %entry
				unreachable

				while.body.i512: ; preds = %entry
				ret void

				sw.bb102: ; preds = %entry
				unreachable

				sw.default: ; preds = %entry
				unreachable
				}

				define void @case4() {
				; CHECK: In function: case4
				; CHECK: Switch instruction:
				; CHECK: switch i32 undef, label %sw.default660 [
				; CHECK: Not suitable for a jump table due to possible CTR use in successor:
				; CHECK: sw.bb466:
				; CHECK: tail call void undef(i8* undef)
				entry:
				switch i32 undef, label %sw.default660 [
				i32 0, label %sw.bb
				i32 20, label %sw.bb458
				i32 19, label %sw.bb444
				i32 15, label %sw.bb361
				i32 29, label %sw.bb44
				i32 13, label %sw.bb53
				i32 17, label %sw.bb377
				i32 16, label %sw.bb372
				i32 25, label %sw.bb535
				i32 24, label %sw.bb484
				i32 36, label %sw.bb366
				i32 14, label %sw.bb315
				i32 12, label %sw.bb310
				i32 30, label %sw.bb466
				]

				sw.bb: ; preds = %entry
				unreachable

				sw.bb44: ; preds = %entry
				unreachable

				sw.bb53: ; preds = %entry
				unreachable

				sw.bb310: ; preds = %entry
				unreachable

				sw.bb315: ; preds = %entry
				unreachable

				sw.bb361: ; preds = %entry
				unreachable

				sw.bb366: ; preds = %entry
				unreachable

				sw.bb372: ; preds = %entry
				unreachable

				sw.bb377: ; preds = %entry
				unreachable

				sw.bb444: ; preds = %entry
				unreachable

				sw.bb458: ; preds = %entry
				unreachable

				sw.bb466: ; preds = %entry
				tail call void undef(i8* undef) #1
				unreachable

				sw.bb484: ; preds = %entry
				unreachable

				sw.bb535: ; preds = %entry
				unreachable

				sw.default660: ; preds = %entry
				unreachable
				}

				define void @case5(i1 %in) {
				; CHECK: In function: case5
				; CHECK: Switch instruction:
				; CHECK: switch i32 %0, label %sw.default [
				; CHECK: Not suitable for a jump table due to possible CTR use in successor:
				; CHECK: sw.bb189: ; preds = %if.end, %if.end, %if.end, %if.end
				; CHECK: switch i32 %1, label %sw.default.i [
				entry:
				br i1 %in, label %cleanup, label %if.end

				if.end: ; preds = %entry
				switch i8 undef, label %sw.default [
				i8 0, label %undef_sstr
				i8 1, label %sw.bb31
				i8 2, label %sw.bb80
				i8 3, label %sw.bb124
				i8 4, label %sw.bb171
				i8 14, label %sw.bb171
				i8 5, label %sw.bb177
				i8 6, label %sw.bb183
				i8 10, label %sw.bb189
				i8 11, label %sw.bb189
				i8 12, label %sw.bb189
				i8 15, label %sw.bb189
				i8 13, label %sw.bb201
				]

				undef_sstr: ; preds = %sw.bb31, %if.end
				unreachable

				sw.bb31: ; preds = %if.end
				br label %undef_sstr

				sw.bb80: ; preds = %if.end
				unreachable

				sw.bb124: ; preds = %if.end
				unreachable

				sw.bb171: ; preds = %if.end, %if.end
				unreachable

				sw.bb177: ; preds = %if.end
				unreachable

				sw.bb183: ; preds = %if.end
				unreachable

				sw.bb189: ; preds = %if.end, %if.end, %if.end, %if.end
				switch i8 undef, label %sw.default.i [
				i8 0, label %sw.bb.i
				i8 1, label %sw.bb.i
				i8 2, label %sw.bb.i
				i8 3, label %sw.bb.i
				i8 4, label %sw.bb.i
				i8 5, label %sw.bb.i
				i8 6, label %sw.bb.i
				i8 7, label %sw.bb.i
				i8 8, label %sw.bb.i
				i8 9, label %sw.bb11.i
				i8 10, label %Perl_sv_reftype.exit
				i8 11, label %sw.bb28.i
				i8 12, label %sw.bb29.i
				i8 13, label %sw.bb30.i
				i8 14, label %sw.bb31.i
				i8 15, label %sw.bb32.i
				]

				sw.bb.i: ; preds = %sw.bb189, %sw.bb189, %sw.bb189, %sw.bb189, %sw.bb189, %sw.bb189, %sw.bb189, %sw.bb189, %sw.bb189
				unreachable

				sw.bb11.i: ; preds = %sw.bb189
				unreachable

				sw.bb28.i: ; preds = %sw.bb189
				unreachable

				sw.bb29.i: ; preds = %sw.bb189
				unreachable

				sw.bb30.i: ; preds = %sw.bb189
				unreachable

				sw.bb31.i: ; preds = %sw.bb189
				unreachable

				sw.bb32.i: ; preds = %sw.bb189
				unreachable

				sw.default.i: ; preds = %sw.bb189
				unreachable

				Perl_sv_reftype.exit: ; preds = %sw.bb189
				unreachable

				sw.bb201: ; preds = %if.end
				unreachable

				sw.default: ; preds = %if.end
				unreachable

				cleanup: ; preds = %entry
				ret void
				}

test/CodeGen/PowerPC/mcm-5.ll

Show All 11 Lines	entry:
%i.addr = alloca i32, align 4		%i.addr = alloca i32, align 4
store i32 %i, i32* %i.addr, align 4		store i32 %i, i32* %i.addr, align 4
%0 = load i32, i32* %i.addr, align 4		%0 = load i32, i32* %i.addr, align 4
switch i32 %0, label %sw.default [		switch i32 %0, label %sw.default [
i32 3, label %sw.bb		i32 3, label %sw.bb
i32 4, label %sw.bb1		i32 4, label %sw.bb1
i32 5, label %sw.bb2		i32 5, label %sw.bb2
i32 6, label %sw.bb3		i32 6, label %sw.bb3
		i32 7, label %sw.bb
		i32 8, label %sw.bb1
		i32 9, label %sw.bb2
		i32 10, label %sw.bb3
]		]

sw.default: ; preds = %entry		sw.default: ; preds = %entry
br label %sw.epilog		br label %sw.epilog

sw.bb: ; preds = %entry		sw.bb: ; preds = %entry
%1 = load i32, i32* %i.addr, align 4		%1 = load i32, i32* %i.addr, align 4
%mul = mul nsw i32 %1, 7		%mul = mul nsw i32 %1, 7
▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines