This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
4/10
InstructionSimplify.cpp
-
test/Transforms/InstSimplify/
-
Transforms/
-
InstSimplify/
-
intrinsic.ll

Differential D111500

[InstSimplify] Simplify intrinsic comparisons with domain knoweldge
Needs ReviewPublic

Authored by jhuber6 on Oct 9 2021, 3:52 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
tra
spatel
lebedev.ri

Summary

This patch adds support for simplifying instrinstic comparisons using
domain knowledge. In this case, a comparison with the NVPTX instrinstic
returning the number of threads and the number of threads in the block
will always be true. We can fold this accordingly.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Oct 9 2021, 3:52 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptOct 9 2021, 3:52 PM

jhuber6 requested review of this revision.Oct 9 2021, 3:52 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 9 2021, 3:52 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B127966: Diff 378481.Oct 9 2021, 4:16 PM

fhahn added a subscriber: fhahn.Oct 11 2021, 4:32 AM

tra added inline comments.Oct 11 2021, 11:59 AM

llvm/lib/Analysis/InstructionSimplify.cpp
609	What if LLVM has been compiled without NVPTX back-end? I'm not sure that NVVM intrinsics will be available then. Perhaps we should re-visit enabling NVVMIntrRange.cpp pass, again. This should make it possible for LLVM to figure this optimization, and more.

nikic added reviewers: spatel, lebedev.ri.Oct 11 2021, 12:06 PM

nikic added a subscriber: nikic.Oct 11 2021, 12:12 PM

nikic added inline comments.

llvm/lib/Analysis/InstructionSimplify.cpp
609	I believe intrinsics are always included, even if the target is disabled. But I also don't think we have precedent for target intrinsic handling in InstSimplify, so adding @spatel and @lebedev.ri for that. Though I don't really see a problem with it. We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange I believe. Though I don't think that would cover the particular use-case here, because range metadata is not sufficient to derive this result.

jhuber6 added inline comments.Oct 11 2021, 12:13 PM

llvm/lib/Analysis/InstructionSimplify.cpp
609	Ranges would only give us an upper bound right? Maybe we could insert `llvm.assume` calls as wall as the ranges there, then . I think intrinsic functions are available, but I haven't checked. We use them in OpenMPOpt which is in the default pipeline and I haven't heard any complains so maybe it's probably fine.

tra added inline comments.Oct 11 2021, 12:39 PM

llvm/lib/Analysis/InstructionSimplify.cpp
609	Ranges would only give us an upper bound right? Yes, they do not provide any info about relationship between launch grid parameters. Maybe we could insert llvm.assume calls as wall as the ranges there, then . Something like that.
611	We could also return false for `threadIdx.x == blockSize.x` and true for `!=`. Also, the optimization should apply to `blockIdx` and `gridDim` comparisons, too.

jhuber6 added inline comments.Oct 11 2021, 1:05 PM

llvm/lib/Analysis/InstructionSimplify.cpp
609	I think using `llvm.assume` would be a good solution in general if we can get it to work, might make all of these cases automatic. Do we want to go down that avenue or just stick with this as the more straightforward option. Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough.
611	We should also think about applying this to AMD, but I remember they didn't have great intrinsics for some of these like Nvidia does.

We should not add target-specific code to generic analysis/passes if we can avoid it. I realize there are still target-specific intrinsic references in instcombine and even constant folding, but those are considered mistakes.

Previous discussions about this were:
https://lists.llvm.org/pipermail/llvm-dev/2016-July/102317.html
https://lists.llvm.org/pipermail/llvm-dev/2020-June/142859.html

So I think it's correct -- at least currently -- that all target-specific intrinsic definitions are included whether you build all targets or not. But that's not ideal - if someone only cares about one particular target, they shouldn't be burdened with defs and code for other targets.

D81728 made instcombine more flexible by adding a TTI hook as mentioned in an earlier comment. Using that (even if it's in a hacky way that walks uses of the intrinsic) or a target-specific pass would be better than polluting a generic analysis with target-specific code.

tra added inline comments.Oct 11 2021, 1:57 PM

llvm/lib/Analysis/InstructionSimplify.cpp
609	Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough. It triggered odd regressions in tensorflow code that I was unable to find the root cause for. With the pass providing only minor benefits, I've just got it disabled by default. I'll try to re-test with the pass enabled and see how it fares now.

tra added inline comments.Oct 11 2021, 2:22 PM

llvm/lib/Analysis/InstructionSimplify.cpp
609	We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange Interesting. We could indeed add ranges metadata there. I'm just not sure it's the best place for that. In order to be usefuf, we want ranges metadata to be available early. Adding it as a side-effect of InstCombine seems a bit odd -- both because it's not an optimization and because we'd run it multiple times even though we only need to add metadata only once per intrinsic. I guess, ideally it should be up to the intrinsic itself to provide the value range, but that's not something that exists right now. I think a one-shot pass that we can schedule independently is a decent fit for the job. Also, I think may have figured out why `NVVMIntrRange` was causing the problems. I suspect that with the new pass manager the pass may have been initialized with the default constructor and that might give incorrect range info for the newer GPUs.

jhuber6 added inline comments.Oct 11 2021, 2:41 PM

llvm/lib/Analysis/InstructionSimplify.cpp
609	Thanks, if that works then I can try implementing this functionality with assumptions there and avoid the intrinsic here.

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:24 PM

Herald added a subscriber: StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Analysis/

InstructionSimplify.cpp

21 lines

test/

Transforms/

InstSimplify/

intrinsic.ll

35 lines

Diff 378481

llvm/lib/Analysis/InstructionSimplify.cpp

Show All 33 Lines
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/ConstantRange.h"		#include "llvm/IR/ConstantRange.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/GetElementPtrTypeIterator.h"		#include "llvm/IR/GetElementPtrTypeIterator.h"
#include "llvm/IR/GlobalAlias.h"		#include "llvm/IR/GlobalAlias.h"
#include "llvm/IR/InstrTypes.h"		#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
		#include "llvm/IR/IntrinsicsNVPTX.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/ValueHandle.h"		#include "llvm/IR/ValueHandle.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"
#include <algorithm>		#include <algorithm>
using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

▲ Show 20 Lines • Show All 544 Lines • ▼ Show 20 Lines	for (unsigned u = 0, e = PI->getNumIncomingValues(); u < e; ++u) {
if (!V \|\| (CommonValue && V != CommonValue))		if (!V \|\| (CommonValue && V != CommonValue))
return nullptr;		return nullptr;
CommonValue = V;		CommonValue = V;
}		}

return CommonValue;		return CommonValue;
}		}

		static Constant foldIntrinsicConstant(ICmpInst::Predicate Pred, Value Op0,
		Value Op1, Type RetTy) {
		IntrinsicInst *Inst0 = dyn_cast<IntrinsicInst>(Op0);
		IntrinsicInst *Inst1 = dyn_cast<IntrinsicInst>(Op1);

		// fold %cmp = icmp slt i32 %tid, %ntid to true.
		if (Inst0->getIntrinsicID() == Intrinsic::nvvm_read_ptx_sreg_tid_x &&
		traUnsubmitted Not Done Reply Inline Actions What if LLVM has been compiled without NVPTX back-end? I'm not sure that NVVM intrinsics will be available then. Perhaps we should re-visit enabling NVVMIntrRange.cpp pass, again. This should make it possible for LLVM to figure this optimization, and more. tra: What if LLVM has been compiled without NVPTX back-end? I'm not sure that NVVM intrinsics will…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Ranges would only give us an upper bound right? Maybe we could insert `llvm.assume` calls as wall as the ranges there, then . I think intrinsic functions are available, but I haven't checked. We use them in OpenMPOpt which is in the default pipeline and I haven't heard any complains so maybe it's probably fine. jhuber6: Ranges would only give us an upper bound right? Maybe we could insert `llvm.assume` calls as…
		traUnsubmitted Not Done Reply Inline Actions Ranges would only give us an upper bound right? Yes, they do not provide any info about relationship between launch grid parameters. Maybe we could insert llvm.assume calls as wall as the ranges there, then . Something like that. tra: > Ranges would only give us an upper bound right? Yes, they do not provide any info about…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I think using `llvm.assume` would be a good solution in general if we can get it to work, might make all of these cases automatic. Do we want to go down that avenue or just stick with this as the more straightforward option. Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough. jhuber6: I think using `llvm.assume` would be a good solution in general if we can get it to work, might…
		traUnsubmitted Not Done Reply Inline Actions Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough. It triggered odd regressions in tensorflow code that I was unable to find the root cause for. With the pass providing only minor benefits, I've just got it disabled by default. I'll try to re-test with the pass enabled and see how it fares now. tra: > Is there a reason the NVVMIntrRange.cpp isn't currently enabled? Seems straightforward enough.
		nikicUnsubmitted Not Done Reply Inline Actions I believe intrinsics are always included, even if the target is disabled. But I also don't think we have precedent for target intrinsic handling in InstSimplify, so adding @spatel and @lebedev.ri for that. Though I don't really see a problem with it. We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange I believe. Though I don't think that would cover the particular use-case here, because range metadata is not sufficient to derive this result. nikic: I believe intrinsics are always included, even if the target is disabled. But I also don't…
		traUnsubmitted Not Done Reply Inline Actions We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the intrinsic. You could use that to replace NVVMIntrRange Interesting. We could indeed add ranges metadata there. I'm just not sure it's the best place for that. In order to be usefuf, we want ranges metadata to be available early. Adding it as a side-effect of InstCombine seems a bit odd -- both because it's not an optimization and because we'd run it multiple times even though we only need to add metadata only once per intrinsic. I guess, ideally it should be up to the intrinsic itself to provide the value range, but that's not something that exists right now. I think a one-shot pass that we can schedule independently is a decent fit for the job. Also, I think may have figured out why `NVVMIntrRange` was causing the problems. I suspect that with the new pass manager the pass may have been initialized with the default constructor and that might give incorrect range info for the newer GPUs. tra: > We do provide InstCombine hooks (instCombineIntrinsic in TTI), but those work directly on the…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions Thanks, if that works then I can try implementing this functionality with assumptions there and avoid the intrinsic here. jhuber6: Thanks, if that works then I can try implementing this functionality with assumptions there and…
		Inst1->getIntrinsicID() == Intrinsic::nvvm_read_ptx_sreg_ntid_x)
		if (ICmpInst::isLE(Pred) \|\| ICmpInst::isLT(Pred))
		traUnsubmitted Not Done Reply Inline Actions We could also return false for `threadIdx.x == blockSize.x` and true for `!=`. Also, the optimization should apply to `blockIdx` and `gridDim` comparisons, too. tra: We could also return false for `threadIdx.x == blockSize.x` and true for `!=`. Also, the…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions We should also think about applying this to AMD, but I remember they didn't have great intrinsics for some of these like Nvidia does. jhuber6: We should also think about applying this to AMD, but I remember they didn't have great…
		return ConstantInt::getTrue(RetTy);

		return nullptr;
		}

static Constant *foldOrCommuteConstant(Instruction::BinaryOps Opcode,		static Constant *foldOrCommuteConstant(Instruction::BinaryOps Opcode,
Value &Op0, Value &Op1,		Value &Op0, Value &Op1,
const SimplifyQuery &Q) {		const SimplifyQuery &Q) {
if (auto *CLHS = dyn_cast<Constant>(Op0)) {		if (auto *CLHS = dyn_cast<Constant>(Op0)) {
if (auto *CRHS = dyn_cast<Constant>(Op1))		if (auto *CRHS = dyn_cast<Constant>(Op1))
return ConstantFoldBinaryOpOperands(Opcode, CLHS, CRHS, Q.DL);		return ConstantFoldBinaryOpOperands(Opcode, CLHS, CRHS, Q.DL);

// Canonicalize the constant to the RHS if this is a commutative operation.		// Canonicalize the constant to the RHS if this is a commutative operation.
▲ Show 20 Lines • Show All 3,088 Lines • ▼ Show 20 Lines	if (Value *V = ThreadCmpOverSelect(Pred, LHS, RHS, Q, MaxRecurse))
return V;		return V;

// If the comparison is with the result of a phi instruction, check whether		// If the comparison is with the result of a phi instruction, check whether
// doing the compare with each incoming phi value yields a common result.		// doing the compare with each incoming phi value yields a common result.
if (isa<PHINode>(LHS) \|\| isa<PHINode>(RHS))		if (isa<PHINode>(LHS) \|\| isa<PHINode>(RHS))
if (Value *V = ThreadCmpOverPHI(Pred, LHS, RHS, Q, MaxRecurse))		if (Value *V = ThreadCmpOverPHI(Pred, LHS, RHS, Q, MaxRecurse))
return V;		return V;

		// If the comparison is with two instrinsic instructions try to fold them
		// using domain knowledge.
		if (isa<IntrinsicInst>(LHS) && isa<IntrinsicInst>(RHS))
		if (Constant *C = foldIntrinsicConstant(Pred, LHS, RHS, ITy))
		return C;

return nullptr;		return nullptr;
}		}

Value llvm::SimplifyICmpInst(unsigned Predicate, Value LHS, Value *RHS,		Value llvm::SimplifyICmpInst(unsigned Predicate, Value LHS, Value *RHS,
const SimplifyQuery &Q) {		const SimplifyQuery &Q) {
return ::SimplifyICmpInst(Predicate, LHS, RHS, Q, RecursionLimit);		return ::SimplifyICmpInst(Predicate, LHS, RHS, Q, RecursionLimit);
}		}

▲ Show 20 Lines • Show All 2,729 Lines • Show Last 20 Lines

llvm/test/Transforms/InstSimplify/intrinsic.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -instsimplify -S \| FileCheck %s

				define i32 @compare() {
				; CHECK-LABEL: @compare(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 true, label [[IF_THEN:%.]], label [[IF_ELSE:%.]]
				; CHECK: if.then:
				; CHECK-NEXT: br label [[RETURN:%.*]]
				; CHECK: if.else:
				; CHECK-NEXT: br label [[RETURN]]
				; CHECK: return:
				; CHECK-NEXT: [[RETVAL:%.*]] = phi i32 [ 1, [[IF_THEN]] ], [ 0, [[IF_ELSE]] ]
				; CHECK-NEXT: ret i32 [[RETVAL]]
				;
				entry:
				%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				%ntid = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
				%cmp = icmp slt i32 %tid, %ntid
				br i1 %cmp, label %if.then, label %if.else

				if.then: ; preds = %entry
				br label %return

				if.else: ; preds = %entry
				br label %return

				return: ; preds = %if.else, %if.then
				%retval = phi i32 [ 1, %if.then ], [ 0, %if.else ]
				ret i32 %retval
				}

				declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()

				declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()