This patch adds support for simplifying instrinstic comparisons using
domain knowledge. In this case, a comparison with the NVPTX instrinstic
returning the number of threads and the number of threads in the block
will always be true. We can fold this accordingly.
What if LLVM has been compiled without NVPTX back-end? I'm not sure that NVVM intrinsics will be available then.
Perhaps we should re-visit enabling NVVMIntrRange.cpp pass, again. This should make it possible for LLVM to figure this optimization, and more.
I believe intrinsics are always included, even if the target is disabled. But I also don't think we have precedent for target intrinsic handling in InstSimplify, so adding @spatel and @lebedev.ri for that. Though I don't really see a problem with it.
We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange I believe. Though I don't think that would cover the particular use-case here, because range metadata is not sufficient to derive this result.
Ranges would only give us an upper bound right? Maybe we could insert llvm.assume calls as wall as the ranges there, then .
I think intrinsic functions are available, but I haven't checked. We use them in OpenMPOpt which is in the default pipeline and I haven't heard any complains so maybe it's probably fine.
Yes, they do not provide any info about relationship between launch grid parameters.
Something like that.
We could also return false for threadIdx.x == blockSize.x and true for !=.
Also, the optimization should apply to blockIdx and gridDim comparisons, too.
I think using llvm.assume would be a good solution in general if we can get it to work, might make all of these cases automatic. Do we want to go down that avenue or just stick with this as the more straightforward option.
Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough.
We should also think about applying this to AMD, but I remember they didn't have great intrinsics for some of these like Nvidia does.
We should not add target-specific code to generic analysis/passes if we can avoid it. I realize there are still target-specific intrinsic references in instcombine and even constant folding, but those are considered mistakes.
Previous discussions about this were:
So I think it's correct -- at least currently -- that all target-specific intrinsic definitions are included whether you build all targets or not. But that's not ideal - if someone only cares about one particular target, they shouldn't be burdened with defs and code for other targets.
D81728 made instcombine more flexible by adding a TTI hook as mentioned in an earlier comment. Using that (even if it's in a hacky way that walks uses of the intrinsic) or a target-specific pass would be better than polluting a generic analysis with target-specific code.
It triggered odd regressions in tensorflow code that I was unable to find the root cause for.
Interesting. We could indeed add ranges metadata there. I'm just not sure it's the best place for that. In order to be usefuf, we want ranges metadata to be available early. Adding it as a side-effect of InstCombine seems a bit odd -- both because it's not an optimization and because we'd run it multiple times even though we only need to add metadata only once per intrinsic. I guess, ideally it should be up to the intrinsic itself to provide the value range, but that's not something that exists right now. I think a one-shot pass that we can schedule independently is a decent fit for the job.
Also, I think may have figured out why NVVMIntrRange was causing the problems. I suspect that with the new pass manager the pass may have been initialized with the default constructor and that might give incorrect range info for the newer GPUs.