This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
InstCombineCasts.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
1
phi-multiple-zext.ll

Differential D123408

[InstCombine] Limit folding of cast into PHI
AbandonedPublic

Authored by syzaara on Apr 8 2022, 10:16 AM.

Download Raw Diff

Details

Reviewers

bmahjour
Whitney
spatel
nikic

Summary

InstCombine folds a cast of a phi into the phi. Sometimes, this isn't beneficial because there could be another user of the phi which is also a cast such a zext which feed an add (or other operations which vectorization would consider for reductions) and then truncate. When folding the cast into the phi, it can break the pattern of zext + add + truncate which otherwise could've been simplified into a smaller width add and been considered for loop vectorization.

Diff Detail

Unit TestsFailed

	Time	Test
	710 ms	x64 debian > SanitizerCommon-asan-x86_64-Linux.Linux::decorate_proc_maps.cpp
	900 ms	x64 debian > SanitizerCommon-tsan-x86_64-Linux.Linux::decorate_proc_maps.cpp
	60,050 ms	x64 debian > libFuzzer.libFuzzer::large.test

Event Timeline

syzaara created this revision.Apr 8 2022, 10:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 10:16 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

syzaara requested review of this revision.Apr 8 2022, 10:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 10:16 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

This seems like a vectorizer bug.

In D123408#3439555, @lebedev.ri wrote:

This seems like a vectorizer bug.

The reason it doesn't get vectorized is because loop vectorizer sees a truncating and instruction in the def-use chain of the reduction phi. The IR that vectorizer sees looks like this:

define internal fastcc void @do_one() unnamed_addr #1 {
entry:
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*)) #3
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
  %i.09 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %arrayidx = getelementptr inbounds [8192 x i16], [8192 x i16]* @x, i64 0, i64 %i.09
  %0 = load i16, i16* %arrayidx, align 2, !tbaa !6
  %conv = zext i16 %0 to i32
  %add = add nuw nsw i32 %a.010, %conv
  %inc = add nuw nsw i64 %i.09, 1
  %phi.cast = and i32 %add, 65535
  %exitcond.not = icmp eq i64 %inc, 8192
  br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !10

for.end:                                          ; preds = %for.body
  %phi.cast.lcssa = phi i32 [ %phi.cast, %for.body ]
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*), i32 noundef signext %phi.cast.lcssa) #3
  ret void
}

The %phi.cast = and i32 %add, 65535 instruction is not an ADD so the %a.010 phi doesn't qualify as an add reduction.

LV: Not vectorizing: Found an unidentified PHI %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]

In D123408#3439571, @syzaara wrote:

In D123408#3439555, @lebedev.ri wrote:

This seems like a vectorizer bug.

The reason it doesn't get vectorized is because loop vectorizer sees a truncating and instruction in the def-use chain of the reduction phi. The IR that vectorizer sees looks like this:

define internal fastcc void @do_one() unnamed_addr #1 {
entry:
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*)) #3
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
  %i.09 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %arrayidx = getelementptr inbounds [8192 x i16], [8192 x i16]* @x, i64 0, i64 %i.09
  %0 = load i16, i16* %arrayidx, align 2, !tbaa !6
  %conv = zext i16 %0 to i32
  %add = add nuw nsw i32 %a.010, %conv
  %inc = add nuw nsw i64 %i.09, 1
  %phi.cast = and i32 %add, 65535
  %exitcond.not = icmp eq i64 %inc, 8192
  br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !10

for.end:                                          ; preds = %for.body
  %phi.cast.lcssa = phi i32 [ %phi.cast, %for.body ]
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*), i32 noundef signext %phi.cast.lcssa) #3
  ret void
}

The %phi.cast = and i32 %add, 65535 instruction is not an ADD so the %a.010 phi doesn't qualify as an add reduction.

LV: Not vectorizing: Found an unidentified PHI %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]

Yep, i agree. The vectorizer just needs to be able to understand that such an and effectively truncates the bit width.

Harbormaster completed remote builds in B158732: Diff 421571.Apr 8 2022, 11:19 AM

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

In D123408#3443538, @bmahjour wrote:

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

In D123408#3489191, @syzaara wrote:

In D123408#3443538, @bmahjour wrote:

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

Sorry, lost track here. I'm not familiar enough with LV to recommend the solution,
but it sounds vaguely reasonable to me. But, do you need the whole <op>-then-<lop> generality?
The only reason why <op>-then-and is useful, is because that and specifies
the effective bitwidth of the reduction, but if the high bits aren't demanded arithmetic/logic ops can be losslessly performed in narrower bit widths:
i32 65535 + i32 65535 = i32 131070 = 0x1FFFE, (trunc(i32 65535) to i8) + (trunc(i32 65535) to i8) = i8 510 = 0xFE, note how low 8 bits are the same.
Perhaps the solution should be around tracking the demanded bit width?

In D123408#3489233, @lebedev.ri wrote:

In D123408#3489191, @syzaara wrote:

In D123408#3443538, @bmahjour wrote:

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

Sorry, lost track here. I'm not familiar enough with LV to recommend the solution,
but it sounds vaguely reasonable to me. But, do you need the whole <op>-then-<lop> generality?
The only reason why <op>-then-and is useful, is because that and specifies
the effective bitwidth of the reduction, but if the high bits aren't demanded arithmetic/logic ops can be losslessly performed in narrower bit widths:
i32 65535 + i32 65535 = i32 131070 = 0x1FFFE, (trunc(i32 65535) to i8) + (trunc(i32 65535) to i8) = i8 510 = 0xFE, note how low 8 bits are the same.
Perhaps the solution should be around tracking the demanded bit width?

If we think about this problem as an issue with determination of demanded bitwidth, I'm not sure why it would be the job of LV to understand it and treat it in a special way, where InstCombine could have determined it and generated code without the extra widening/truncating instructions. After all, InstCombine is meant to simplify the IR for downstream passes (not LV or other non-canonicalization passes). The one complication is the effect of nsw on that add. Not sure if InstCombine aims at keeping that flag and knowingly treats it as more beneficial than making other simplifications. I suspect it's not, but would be good to verify.

Fix testcase missing target datalayout.

In D123408#3560846, @bmahjour wrote:

In D123408#3489233, @lebedev.ri wrote:

In D123408#3489191, @syzaara wrote:

In D123408#3443538, @bmahjour wrote:

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

Sorry, lost track here. I'm not familiar enough with LV to recommend the solution,
but it sounds vaguely reasonable to me. But, do you need the whole <op>-then-<lop> generality?
The only reason why <op>-then-and is useful, is because that and specifies
the effective bitwidth of the reduction, but if the high bits aren't demanded arithmetic/logic ops can be losslessly performed in narrower bit widths:
i32 65535 + i32 65535 = i32 131070 = 0x1FFFE, (trunc(i32 65535) to i8) + (trunc(i32 65535) to i8) = i8 510 = 0xFE, note how low 8 bits are the same.
Perhaps the solution should be around tracking the demanded bit width?

If we think about this problem as an issue with determination of demanded bitwidth, I'm not sure why it would be the job of LV to understand it and treat it in a special way, where InstCombine could have determined it and generated code without the extra widening/truncating instructions. After all, InstCombine is meant to simplify the IR for downstream passes (not LV or other non-canonicalization passes). The one complication is the effect of nsw on that add. Not sure if InstCombine aims at keeping that flag and knowingly treats it as more beneficial than making other simplifications. I suspect it's not, but would be good to verify.

I agree, it seems more natural to me that InstCombine would do the simplification rather than LV to trace through the demanded bits. I have verified that InstCombine isn't aiming to skip the transformation in preference on keeping the nsw flag. It just happens that the phi is simplified first and we miss the opportunity to simplify the add into a smaller width add.

@spatel Could you please provide some feedback about this change to InstCombine.

Harbormaster completed remote builds in B168853: Diff 435589.Jun 9 2022, 10:24 AM

In D123408#3570605, @syzaara wrote:

@spatel Could you please provide some feedback about this change to InstCombine.

In general, making exceptions to transforms will not solve the general problem. I'm not sure if this is enough, but instcombine seems to miss a narrowing transform like this:
https://alive2.llvm.org/ce/z/hRy3rE

We do already have a variation of that fold, so it should be a small enhancement. I can take a shot at that.
A demanded bits solution within instcombine might be better, but I'm not seeing how to make that work in general since we have to create 3 instructions from the pattern.

Deoptimize in instcombine in favour of the LV sounds like an anti-pattern. Do you deoptimize in favour of PGO?

Maybe have a instcombine.rst: We optimize anything we can. There are some limits here. We delegate some optimizations to FOO. We never deoptimize in favour of other passes.

In D123408#3571038, @spatel wrote:

In D123408#3570605, @syzaara wrote:

@spatel Could you please provide some feedback about this change to InstCombine.

In general, making exceptions to transforms will not solve the general problem. I'm not sure if this is enough, but instcombine seems to miss a narrowing transform like this:
https://alive2.llvm.org/ce/z/hRy3rE

We do already have a variation of that fold, so it should be a small enhancement. I can take a shot at that.
A demanded bits solution within instcombine might be better, but I'm not seeing how to make that work in general since we have to create 3 instructions from the pattern.

From what I understand Instcombine has the narrowing transformation already, but there is an issue with order of instructions visited and which optimization runs first.
For example looking at the debug output, with the changes in this patch:

IC: Visiting:   %conv2 = trunc i32 %add to i8
ICE: EvaluateInDifferentType converting expression type to avoid cast:   %conv2 = trunc i32 %add to i8
IC: Replacing   %conv2 = trunc i32 %1 to i8
with   %add = add i8 %a.0, %0

It converts:

%conv = zext i8 %0 to i32
%conv1 = zext i8 %a.0 to i32
%add = add nsw i32 %conv1, %conv
%conv2 = trunc i32 %add to i8

to:

%add = add i8 %a.0, %0

Without this patch, it first does the optimization to fold the cast into the phi when visiting: conv3 = zext i8 %a.0 to i32

IC: Visiting:   %conv3 = zext i8 %a.0 to i32
  ADD DEFERRED:   %0 = phi i32  
  ADD DEFERRED:   %phi.cast = zext i8 %conv2 to i32
  IC: Replacing   %conv1 = zext i8 %0 to i32
      with   %a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]

And we no longer get the narrowing transformation:

%a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
%conv = zext i8 %0 to i32
%add = add nuw nsw i32 %a.0, %conv 
%inc = add i64 %i.0, 1
%phi.cast = and i32 %add, 255

I am trying to disable the phi cast optimization so that InstCombine can do that narrowing transformation instead.

In D123408#3571242, @syzaara wrote:

In D123408#3571038, @spatel wrote:

In D123408#3570605, @syzaara wrote:

@spatel Could you please provide some feedback about this change to InstCombine.

In general, making exceptions to transforms will not solve the general problem. I'm not sure if this is enough, but instcombine seems to miss a narrowing transform like this:
https://alive2.llvm.org/ce/z/hRy3rE

We do already have a variation of that fold, so it should be a small enhancement. I can take a shot at that.
A demanded bits solution within instcombine might be better, but I'm not seeing how to make that work in general since we have to create 3 instructions from the pattern.

Oh nevermind, please ignore above comment, I see what you are saying now! That we can add a transformation which can convert:

%a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
%conv = zext i8 %0 to i32
%add = add nuw nsw i32 %a.0, %conv 
%phi.cast = and i32 %add, 255

Back to:

%a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
%conv = trunc i32 %a.0 to i8
%add = add i8 %0, %conv
%conv2 = zext i8 %add to i32

So we don't need to disable the phi cast optimization, but will still get the i8 add.

spatel mentioned this in rGafa192cfb604: [InstCombine] add narrowing transform for low-masked binop with zext operand.Jun 9 2022, 2:01 PM

In D123408#3571242, @syzaara wrote:

Oh nevermind, please ignore above comment, I see what you are saying now! That we can add a transformation which can convert:

Right - see if the motivating example looks better with:
afa192cfb604

The problem with this patch is that it tries to carve out a special case for a specific use-case, which is something we really don't like in InstCombine...

...however, I think there is a real problem here, and it doesn't have anything to do with reductions or vectorization. What this transform is doing is convert op(phi(C, X)) into phi(op(C), op(X)), effectively just moving op from one block to another. This is usually profitable, because it reduces the paths on which op needs to be computed.

There are some exceptions to this. The first is if the X edge is critical, which is something against which the transform already guards. The second is if X is a loop backedge. In that case we either a) move an instruction from one point of a loop to another or b) we might even move an instruction from outside the loop into the loop.

The transform already has a related guard in https://github.com/llvm/llvm-project/blob/70d35fe1257e429266b83025997b400e9f79110e/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp#L1180 which prevents the a) case, which could easily result in an infinite combine loop otherwise. But it does not handle the b) case. For the test case, the transform is actually rooted at %conv3 = zext i8 %a.0 to i32 in for.end (not %conv1 = zext i8 %a.0 to i32 in for.body), for which the reachability check doesn't do anything.

I think the proper fix here would be to adjust that reachability check, to prevent non-profitable transforms involving loop backedges. It should probably be checking reachability from PN->getParent(), not I.getParent().

Edit: Here's a simplified example where just move a trunc from outside to inside a loop for now benefit: https://llvm.godbolt.org/z/7qdvaa6er

llvm/test/Transforms/InstCombine/phi-multiple-zext.ll
2	Please use update_test_checks.py...

nikic added a reviewer: nikic.Jun 9 2022, 2:27 PM

In D123408#3571386, @spatel wrote:

In D123408#3571242, @syzaara wrote:

Oh nevermind, please ignore above comment, I see what you are saying now! That we can add a transformation which can convert:

Right - see if the motivating example looks better with:
afa192cfb604

That commit causes breakage for me - one file which used to compile in around a second now hangs (doesn't complete in many minutes at least).

To reproduce, download https://martin.st/temp/remap-preproc.c and try to compile it with clang -target x86_64-w64-mingw32 -w -c -O2 remap-preproc.c.

I've put up https://reviews.llvm.org/D127499 as an alternative fix.

spatel mentioned this in D127499: [InstCombine] Don't push operation across loop phi.Jun 10 2022, 8:17 AM

In D123408#3572528, @mstorsjo wrote:

afa192cfb604

That commit causes breakage for me - one file which used to compile in around a second now hangs (doesn't complete in many minutes at least).

To reproduce, download https://martin.st/temp/remap-preproc.c and try to compile it with clang -target x86_64-w64-mingw32 -w -c -O2 remap-preproc.c.

I reverted that commit. I do have an updated patch that accounts for the problem pattern that caused the infinite loop (blame constant expressions...again). Between that and D127499, I think it's safe to abandon this patch.

spatel mentioned this in rGe247b0e5c921: [InstCombine] add narrowing transform for low-masked binop with zext operand….Jun 10 2022, 9:44 AM

In D123408#3573723, @spatel wrote:

In D123408#3572528, @mstorsjo wrote:

afa192cfb604

That commit causes breakage for me - one file which used to compile in around a second now hangs (doesn't complete in many minutes at least).

To reproduce, download https://martin.st/temp/remap-preproc.c and try to compile it with clang -target x86_64-w64-mingw32 -w -c -O2 remap-preproc.c.

I reverted that commit. I do have an updated patch that accounts for the problem pattern that caused the infinite loop (blame constant expressions...again). Between that and D127499, I think it's safe to abandon this patch.

Yes, I tested the motivating example with D127499, and it does resolve the issue. I will abandon this patch now. Thank you.

syzaara abandoned this revision.Jun 13 2022, 7:03 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

InstCombine/

InstCombineCasts.cpp

42 lines

test/

Transforms/

InstCombine/

phi-multiple-zext.ll

37 lines

Diff 421571

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp

Show First 20 Lines • Show All 268 Lines • ▼ Show 20 Lines	InstCombinerImpl::isEliminableCastPair(const CastInst *CI1,
// type that differs from the pointer size.		// type that differs from the pointer size.
if ((Res == Instruction::IntToPtr && SrcTy != DstIntPtrTy) \|\|		if ((Res == Instruction::IntToPtr && SrcTy != DstIntPtrTy) \|\|
(Res == Instruction::PtrToInt && DstTy != SrcIntPtrTy))		(Res == Instruction::PtrToInt && DstTy != SrcIntPtrTy))
Res = 0;		Res = 0;

return Instruction::CastOps(Res);		return Instruction::CastOps(Res);
}		}

		static bool castFeedsReduction(PHINode *PN, CastInst &CI) {
		for (User *PhiUser : PN->users()) {
		auto *UserCI = dyn_cast<CastInst>(PhiUser);
		if (UserCI && UserCI != &CI) {
		for (User *U : UserCI->users()) {
		auto *BO = dyn_cast<BinaryOperator>(U);
		if (!BO)
		return false;
		switch(BO->getOpcode()) {
		case Instruction::Add:
		case Instruction::Sub:
		case Instruction::Mul:
		case Instruction::And:
		case Instruction::Or:
		case Instruction::Xor:
		return true;
		default:
		break;
		}
		}
		}
		}
		return false;
		}
/// Implement the transforms common to all CastInst visitors.		/// Implement the transforms common to all CastInst visitors.
Instruction *InstCombinerImpl::commonCastTransforms(CastInst &CI) {		Instruction *InstCombinerImpl::commonCastTransforms(CastInst &CI) {
Value *Src = CI.getOperand(0);		Value *Src = CI.getOperand(0);
Type *Ty = CI.getType();		Type *Ty = CI.getType();

// Try to eliminate a cast of a cast.		// Try to eliminate a cast of a cast.
if (auto *CSrc = dyn_cast<CastInst>(Src)) { // A->B->C cast		if (auto *CSrc = dyn_cast<CastInst>(Src)) { // A->B->C cast
if (Instruction::CastOps NewOpc = isEliminableCastPair(CSrc, &CI)) {		if (Instruction::CastOps NewOpc = isEliminableCastPair(CSrc, &CI)) {
Show All 21 Lines	if (!Cmp \|\| Cmp->getOperand(0)->getType() != Sel->getType() \|\|
replaceAllDbgUsesWith(Sel, NV, CI, DT);		replaceAllDbgUsesWith(Sel, NV, CI, DT);
return NV;		return NV;
}		}
}		}
}		}

// If we are casting a PHI, then fold the cast into the PHI.		// If we are casting a PHI, then fold the cast into the PHI.
if (auto *PN = dyn_cast<PHINode>(Src)) {		if (auto *PN = dyn_cast<PHINode>(Src)) {
		// Don't do this if there is another user of this phi which is also a cast
		// instruction which then feeds an instruction that vectorization would
		// consider for reductions. We might be able to simply the other cast in
		// a way that does not hinder loop vectorization opportunities.
		if (!castFeedsReduction(PN, CI)) {
// Don't do this if it would create a PHI node with an illegal type from a		// Don't do this if it would create a PHI node with an illegal type from a
// legal type.		// legal type.
if (!Src->getType()->isIntegerTy() \|\| !CI.getType()->isIntegerTy() \|\|		if (!Src->getType()->isIntegerTy() \|\| !CI.getType()->isIntegerTy() \|\|
shouldChangeType(CI.getSrcTy(), CI.getType()))		shouldChangeType(CI.getSrcTy(), CI.getType()))
if (Instruction *NV = foldOpIntoPhi(CI, PN))		if (Instruction *NV = foldOpIntoPhi(CI, PN))
return NV;		return NV;
}		}
		}

// Canonicalize a unary shuffle after the cast if neither operation changes		// Canonicalize a unary shuffle after the cast if neither operation changes
// the size or element size of the input vector.		// the size or element size of the input vector.
// TODO: We could allow size-changing ops if that doesn't harm codegen.		// TODO: We could allow size-changing ops if that doesn't harm codegen.
// cast (shuffle X, Mask) --> shuffle (cast X), Mask		// cast (shuffle X, Mask) --> shuffle (cast X), Mask
Value *X;		Value *X;
ArrayRef<int> Mask;		ArrayRef<int> Mask;
if (match(Src, m_OneUse(m_Shuffle(m_Value(X), m_Undef(), m_Mask(Mask))))) {		if (match(Src, m_OneUse(m_Shuffle(m_Value(X), m_Undef(), m_Mask(Mask))))) {
▲ Show 20 Lines • Show All 2,541 Lines • Show Last 20 Lines

llvm/test/Transforms/InstCombine/phi-multiple-zext.ll

This file was added.

				; RUN: opt -instcombine -S %s \| FileCheck %s

				nikicUnsubmitted Not Done Reply Inline Actions Please use update_test_checks.py... nikic: Please use update_test_checks.py...
				@x = internal global [16384 x i8] zeroinitializer, align 16

				define internal fastcc void @do_one() {
				entry:
				call void (i8, ...) @obfuscate(i8 noundef getelementptr inbounds ([16384 x i8], [16384 x i8]* @x, i64 0, i64 0))
				br label %for.cond

				for.cond: ; preds = %for.body, %entry
				%i.0 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%a.0 = phi i8 [ 0, %entry ], [ %conv2, %for.body ]
				%cmp = icmp ult i64 %i.0, 16384
				br i1 %cmp, label %for.body, label %for.end

				for.body: ; preds = %for.cond
				; CHECK-NOT: %conv = zext i8 %0 to i32
				; CHECK-NOT: %conv1 = zext i8 %a.0 to i32
				; CHECK-NOT: %add = add nsw i32 %conv1, %conv
				; CHECK-NOT: %conv2 = trunc i32 %add to i8
				; CHECK: %add = add i8 %a.0, %0
				%arrayidx = getelementptr inbounds [16384 x i8], [16384 x i8]* @x, i64 0, i64 %i.0
				%0 = load i8, i8* %arrayidx, align 1
				%conv = zext i8 %0 to i32
				%conv1 = zext i8 %a.0 to i32
				%add = add nsw i32 %conv1, %conv
				%conv2 = trunc i32 %add to i8
				%inc = add i64 %i.0, 1
				br label %for.cond

				for.end: ; preds = %for.cond
				%conv3 = zext i8 %a.0 to i32
				call void (i8, ...) @obfuscate(i8 noundef getelementptr inbounds ([16384 x i8], [16384 x i8]* @x, i64 0, i64 0), i32 noundef signext %conv3)
				ret void
				}

				declare void @obfuscate(i8* noundef, ...)