This is an archive of the discontinued LLVM Phabricator instance.

RFC: faster isa<IntrinsicInst> (bugged tests?)
AbandonedPublic

Authored by escha on Oct 4 2015, 10:20 PM.

Details

Summary

isa<IntrinsicInst> actually performed a string comparison on the function name instead of just checking that the ID wasn't not_intrinsic, which was ~50 cycles per invocation instead of ~10. This should vastly reduce the number of wasteful string comparisons in intrinsic-heavy code. Furthermore, match() was *already doing this*; we actually had two different sets of behavior going on!

Unfortunately, this change seems to break three tests:

LLVM :: Bitcode/function-local-metadata.3.5.ll
LLVM :: Bitcode/metadata.3.5.ll
LLVM :: Transforms/GlobalOpt/metadata.ll

All of which use intrinsics like "llvm.foo" that don't really exist. Does anyone know what's up with this? Are the tests just bogus?

Diff Detail

Repository
rL LLVM

Event Timeline

escha updated this revision to Diff 36477.Oct 4 2015, 10:20 PM
escha retitled this revision from to RFC: faster isa<IntrinsicInst> (bugged tests?).
escha updated this object.
escha added reviewers: chandlerc, resistor.
escha set the repository for this revision to rL LLVM.
escha added a subscriber: llvm-commits.
escha added a comment.Oct 5 2015, 1:16 PM

Unfortunately, it seems like we have a deeper problem here. Some tests seem to rely on the ability to handle unknown intrinsics, but large parts of LLVM assume that not_intrinsic == not an intrinsic, even though sometimes it can mean unknown_intrinsic. There's a ton of inconsistencies here, and it doesn't seem easy to fix. I tried adding an unknown_intrinsic option, but that just opened an entire other can of worms because lots of code assumed that nonzero intrinsic ID == real, known intrinsic...

silvas added a subscriber: silvas.Oct 5 2015, 4:57 PM

We should optimize getName().startswith("llvm.") on x86 into loading the address and length of the name, a bounds check, a 4-byte integer comparison, a 1-byte integer comparison, and cmp+jmp for each comparison. Assuming the name is in cache (which is probably not the case?) and the branch prediction is good (which it probably is?), that is < 10 cycles on a modern x86. Even on ARM where we can't use unaligned loads, it should still be pretty darn fast.

What is actually taking ~50 cycles? How did you measure that?

(I'm not defending using a string comparison vs. not; just wondering why a comparison against a "short, constant string" is ending up so slow)

escha added a comment.Oct 5 2015, 5:06 PM

I stuck an rdtsc timer around code that traversed the function like this:

if (isa<callinst>(I)) {

START_TIMER;
if (isa<intrinsicinst>(I) and intrinsic ID is assume) {
   do a thing that will never happen
 }
STOP_TIMER;

}

It was ~10 cycles per timer hit when using the ID-based approach, ~50 when using the current approach. The timer already subtracts out the cost of the timer itself and the effects of context switches (approximately).

I don't know why it was so slow, but I still don't think we should be constantly doing string comparisons either way, at least if we can avoid it.

silvas added a comment.Oct 5 2015, 5:33 PM

What is the resulting assembly code?
Also what processor are you measuring on?

I'm a bit wary of doing micro-optimizations like this with nothing but "seems faster on my machine" to go on. We probably want to identify the root cause (if only so that we can look for similar cases in our core classes).

Based on the information you've provided, the only difference on the scale of ~40 cycles that I can think of for this code is that following the pointer inside getName() is typically going to L3 (assuming a modern Intel core). If you simply duplicate the code inside START_TIMER and STOP_TIMER (and make sure the compiler doesn't de-duplicate it; e.g. pull it out into a noinline function), how much does the performance change?

escha added a comment.Oct 5 2015, 5:39 PM

This isn't merely a micro-optimization as I've discovered; it's actually (from some perspective) a bugfix. We use *both methods* of checking whether something is an intrinsic at various points in LLVM, inconsistently; if you believe checking the prefix is the correct method, we should change the places that rely on the ID to use that instead.

escha abandoned this revision.Oct 5 2015, 5:40 PM

Since people don't seem to be particularly interested in fixing broken+inconsistent behavior and believe performing string comparisons is a reasonable and performant way of checking class membership, I'm not going to bother with this.

silvas added a comment.Oct 5 2015, 6:27 PM

The inconsistency is clearly troubling and is worth fixing. This review may
have gotten off to a bad start by focusing on the performance aspect. Any
time correctness and performance get mixed into the same discussion this
sort of thing tends to happen.

(also, I don't think anybody is really advocating the use of string
comparisons)

@vedant: do you know if it is important for compatibility for us to use
strings in the tests Fiona pointed out? Can we just switch those tests to
use a real intrinsic?

  • Sean Silva

I recommend that you un-abandon this, and I'd like to hear Philip's opinion here. As I recall, he's spent some time looking at the time spent in string-matching related to intrinsics.

Could you please post this with full context, however?

reames edited edge metadata.Oct 8 2015, 2:07 PM
reames added a subscriber: reames.

My opinion is that a name does not an intrinsic make. Unless someone
has an existing use case which requires the ability to implicitly add
intrinsics, I would suggest we do the following:

  • Add a rule to the verifier that explicitly requires any function with

an @llvm prefix to have a known intrinsic ID. It should be an error to
use an @llvm. prefix for anything which is not a well known intrinsic.

  • All of the tests which are using an unknown intrinsic are modified to

use @llvm.do_nothing which exists exactly for this purpose.

Once those two steps are done, making isIntrinsic check the intrinsic ID
seems entirely reasonable to me.

p.s. Is there any reason that recalculateIntrinsicID needs to be
public? I don't see an obvious one.

p.p.s. I would also be interested in seeing the code generated for the
current comparison. We should be able to emit fast code for that, but
it's an entirely separate issue from what we do in terms of defining
intrinsics. Doing a string check just isn't needed here.

Philip

vsk added a subscriber: vsk.Oct 8 2015, 4:32 PM

+1 to both of Philip's suggestions.

For backwards compatibility, we should add logic to BitcodeReader that upgrades bogus intrinsics to llvm.do_nothing(). I think llvm::UpgradeIntrinsicCall() exists for this purpose.

(Sean, sorry I missed this earlier.)