Page MenuHomePhabricator

[InstSimplify] Simplify intrinsic comparisons with domain knoweldge
Needs ReviewPublic

Authored by jhuber6 on Oct 9 2021, 3:52 PM.

Details

Summary

This patch adds support for simplifying instrinstic comparisons using
domain knowledge. In this case, a comparison with the NVPTX instrinstic
returning the number of threads and the number of threads in the block
will always be true. We can fold this accordingly.

Diff Detail

Event Timeline

jhuber6 created this revision.Oct 9 2021, 3:52 PM
jhuber6 requested review of this revision.Oct 9 2021, 3:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 9 2021, 3:52 PM
fhahn added a subscriber: fhahn.Oct 11 2021, 4:32 AM
tra added inline comments.Oct 11 2021, 11:59 AM
llvm/lib/Analysis/InstructionSimplify.cpp
609

What if LLVM has been compiled without NVPTX back-end? I'm not sure that NVVM intrinsics will be available then.

Perhaps we should re-visit enabling NVVMIntrRange.cpp pass, again. This should make it possible for LLVM to figure this optimization, and more.

nikic added a subscriber: nikic.Oct 11 2021, 12:12 PM
nikic added inline comments.
llvm/lib/Analysis/InstructionSimplify.cpp
609

I believe intrinsics are always included, even if the target is disabled. But I also don't think we have precedent for target intrinsic handling in InstSimplify, so adding @spatel and @lebedev.ri for that. Though I don't really see a problem with it.

We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange I believe. Though I don't think that would cover the particular use-case here, because range metadata is not sufficient to derive this result.

jhuber6 added inline comments.Oct 11 2021, 12:13 PM
llvm/lib/Analysis/InstructionSimplify.cpp
609

Ranges would only give us an upper bound right? Maybe we could insert llvm.assume calls as wall as the ranges there, then .

I think intrinsic functions are available, but I haven't checked. We use them in OpenMPOpt which is in the default pipeline and I haven't heard any complains so maybe it's probably fine.

tra added inline comments.Oct 11 2021, 12:39 PM
llvm/lib/Analysis/InstructionSimplify.cpp
609

Ranges would only give us an upper bound right?

Yes, they do not provide any info about relationship between launch grid parameters.

Maybe we could insert llvm.assume calls as wall as the ranges there, then .

Something like that.

611

We could also return false for threadIdx.x == blockSize.x and true for !=.

Also, the optimization should apply to blockIdx and gridDim comparisons, too.

jhuber6 added inline comments.Oct 11 2021, 1:05 PM
llvm/lib/Analysis/InstructionSimplify.cpp
609

I think using llvm.assume would be a good solution in general if we can get it to work, might make all of these cases automatic. Do we want to go down that avenue or just stick with this as the more straightforward option.

Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough.

611

We should also think about applying this to AMD, but I remember they didn't have great intrinsics for some of these like Nvidia does.

We should not add target-specific code to generic analysis/passes if we can avoid it. I realize there are still target-specific intrinsic references in instcombine and even constant folding, but those are considered mistakes.

Previous discussions about this were:
https://lists.llvm.org/pipermail/llvm-dev/2016-July/102317.html
https://lists.llvm.org/pipermail/llvm-dev/2020-June/142859.html

So I think it's correct -- at least currently -- that all target-specific intrinsic definitions are included whether you build all targets or not. But that's not ideal - if someone only cares about one particular target, they shouldn't be burdened with defs and code for other targets.

D81728 made instcombine more flexible by adding a TTI hook as mentioned in an earlier comment. Using that (even if it's in a hacky way that walks uses of the intrinsic) or a target-specific pass would be better than polluting a generic analysis with target-specific code.

tra added inline comments.Oct 11 2021, 1:57 PM
llvm/lib/Analysis/InstructionSimplify.cpp
609

Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough.

It triggered odd regressions in tensorflow code that I was unable to find the root cause for.
With the pass providing only minor benefits, I've just got it disabled by default. I'll try to re-test with the pass enabled and see how it fares now.

tra added inline comments.Oct 11 2021, 2:22 PM
llvm/lib/Analysis/InstructionSimplify.cpp
609

We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange

Interesting. We could indeed add ranges metadata there. I'm just not sure it's the best place for that. In order to be usefuf, we want ranges metadata to be available early. Adding it as a side-effect of InstCombine seems a bit odd -- both because it's not an optimization and because we'd run it multiple times even though we only need to add metadata only once per intrinsic. I guess, ideally it should be up to the intrinsic itself to provide the value range, but that's not something that exists right now. I think a one-shot pass that we can schedule independently is a decent fit for the job.

Also, I think may have figured out why NVVMIntrRange was causing the problems. I suspect that with the new pass manager the pass may have been initialized with the default constructor and that might give incorrect range info for the newer GPUs.

jhuber6 added inline comments.Oct 11 2021, 2:41 PM
llvm/lib/Analysis/InstructionSimplify.cpp
609

Thanks, if that works then I can try implementing this functionality with assumptions there and avoid the intrinsic here.