This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Limit folding of cast into PHI
AbandonedPublic

Authored by syzaara on Apr 8 2022, 10:16 AM.

Details

Summary

InstCombine folds a cast of a phi into the phi. Sometimes, this isn't beneficial because there could be another user of the phi which is also a cast such a zext which feed an add (or other operations which vectorization would consider for reductions) and then truncate. When folding the cast into the phi, it can break the pattern of zext + add + truncate which otherwise could've been simplified into a smaller width add and been considered for loop vectorization.

Diff Detail

Event Timeline

syzaara created this revision.Apr 8 2022, 10:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 10:16 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
syzaara requested review of this revision.Apr 8 2022, 10:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 10:16 AM

This seems like a vectorizer bug.

This seems like a vectorizer bug.

The reason it doesn't get vectorized is because loop vectorizer sees a truncating and instruction in the def-use chain of the reduction phi. The IR that vectorizer sees looks like this:

define internal fastcc void @do_one() unnamed_addr #1 {
entry:
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*)) #3
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
  %i.09 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %arrayidx = getelementptr inbounds [8192 x i16], [8192 x i16]* @x, i64 0, i64 %i.09
  %0 = load i16, i16* %arrayidx, align 2, !tbaa !6
  %conv = zext i16 %0 to i32
  %add = add nuw nsw i32 %a.010, %conv
  %inc = add nuw nsw i64 %i.09, 1
  %phi.cast = and i32 %add, 65535
  %exitcond.not = icmp eq i64 %inc, 8192
  br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !10

for.end:                                          ; preds = %for.body
  %phi.cast.lcssa = phi i32 [ %phi.cast, %for.body ]
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*), i32 noundef signext %phi.cast.lcssa) #3
  ret void
}

The %phi.cast = and i32 %add, 65535 instruction is not an ADD so the %a.010 phi doesn't qualify as an add reduction.

LV: Not vectorizing: Found an unidentified PHI %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]

This seems like a vectorizer bug.

The reason it doesn't get vectorized is because loop vectorizer sees a truncating and instruction in the def-use chain of the reduction phi. The IR that vectorizer sees looks like this:

define internal fastcc void @do_one() unnamed_addr #1 {
entry:
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*)) #3
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
  %i.09 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %arrayidx = getelementptr inbounds [8192 x i16], [8192 x i16]* @x, i64 0, i64 %i.09
  %0 = load i16, i16* %arrayidx, align 2, !tbaa !6
  %conv = zext i16 %0 to i32
  %add = add nuw nsw i32 %a.010, %conv
  %inc = add nuw nsw i64 %i.09, 1
  %phi.cast = and i32 %add, 65535
  %exitcond.not = icmp eq i64 %inc, 8192
  br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !10

for.end:                                          ; preds = %for.body
  %phi.cast.lcssa = phi i32 [ %phi.cast, %for.body ]
  tail call void (i8*, ...) @obfuscate(i8* noundef bitcast ([8192 x i16]* @x to i8*), i32 noundef signext %phi.cast.lcssa) #3
  ret void
}

The %phi.cast = and i32 %add, 65535 instruction is not an ADD so the %a.010 phi doesn't qualify as an add reduction.

LV: Not vectorizing: Found an unidentified PHI %a.010 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]

Yep, i agree. The vectorizer just needs to be able to understand that such an and effectively truncates the bit width.

bmahjour added a comment.EditedApr 11 2022, 12:13 PM

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

Sorry, lost track here. I'm not familiar enough with LV to recommend the solution,
but it sounds vaguely reasonable to me. But, do you need the whole <op>-then-<lop> generality?
The only reason why <op>-then-and is useful, is because that and specifies
the effective bitwidth of the reduction, but if the high bits aren't demanded arithmetic/logic ops can be losslessly performed in narrower bit widths:
i32 65535 + i32 65535 = i32 131070 = 0x1FFFE, (trunc(i32 65535) to i8) + (trunc(i32 65535) to i8) = i8 510 = 0xFE, note how low 8 bits are the same.
Perhaps the solution should be around tracking the demanded bit width?

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

Sorry, lost track here. I'm not familiar enough with LV to recommend the solution,
but it sounds vaguely reasonable to me. But, do you need the whole <op>-then-<lop> generality?
The only reason why <op>-then-and is useful, is because that and specifies
the effective bitwidth of the reduction, but if the high bits aren't demanded arithmetic/logic ops can be losslessly performed in narrower bit widths:
i32 65535 + i32 65535 = i32 131070 = 0x1FFFE, (trunc(i32 65535) to i8) + (trunc(i32 65535) to i8) = i8 510 = 0xFE, note how low 8 bits are the same.
Perhaps the solution should be around tracking the demanded bit width?

If we think about this problem as an issue with determination of demanded bitwidth, I'm not sure why it would be the job of LV to understand it and treat it in a special way, where InstCombine could have determined it and generated code without the extra widening/truncating instructions. After all, InstCombine is meant to simplify the IR for downstream passes (not LV or other non-canonicalization passes). The one complication is the effect of nsw on that add. Not sure if InstCombine aims at keeping that flag and knowingly treats it as more beneficial than making other simplifications. I suspect it's not, but would be good to verify.

syzaara updated this revision to Diff 435589.Jun 9 2022, 9:19 AM

Fix testcase missing target datalayout.

The cleanest way I can think of to teach LoopVectorizer about this would be to introduce a whole new set of composite reduction operations of the form <op>-then-<lop> (eg RecurKind::AddThenAnd, RecurKind::MulThenAnd, RecurKind::OrThenAnd, and so on)...and that's just for combining logical and with the known integer reduction ops, so if we want to support e.g. or we'd need to double the number of additional recurrence kinds (and the extra logic that comes with it) again. The identity value would be determined from the <op>, and the <lop> has to be applied when reducing the final vector into a single scalar upon loop exit.

@lebedev.ri is this what you had in mind or is there a better way to do it?

@lebedev.ri Can you please advise if the above described way is how we would implement this within the LoopVectorizer?

Sorry, lost track here. I'm not familiar enough with LV to recommend the solution,
but it sounds vaguely reasonable to me. But, do you need the whole <op>-then-<lop> generality?
The only reason why <op>-then-and is useful, is because that and specifies
the effective bitwidth of the reduction, but if the high bits aren't demanded arithmetic/logic ops can be losslessly performed in narrower bit widths:
i32 65535 + i32 65535 = i32 131070 = 0x1FFFE, (trunc(i32 65535) to i8) + (trunc(i32 65535) to i8) = i8 510 = 0xFE, note how low 8 bits are the same.
Perhaps the solution should be around tracking the demanded bit width?

If we think about this problem as an issue with determination of demanded bitwidth, I'm not sure why it would be the job of LV to understand it and treat it in a special way, where InstCombine could have determined it and generated code without the extra widening/truncating instructions. After all, InstCombine is meant to simplify the IR for downstream passes (not LV or other non-canonicalization passes). The one complication is the effect of nsw on that add. Not sure if InstCombine aims at keeping that flag and knowingly treats it as more beneficial than making other simplifications. I suspect it's not, but would be good to verify.

I agree, it seems more natural to me that InstCombine would do the simplification rather than LV to trace through the demanded bits. I have verified that InstCombine isn't aiming to skip the transformation in preference on keeping the nsw flag. It just happens that the phi is simplified first and we miss the opportunity to simplify the add into a smaller width add.

@spatel Could you please provide some feedback about this change to InstCombine.

@spatel Could you please provide some feedback about this change to InstCombine.

In general, making exceptions to transforms will not solve the general problem. I'm not sure if this is enough, but instcombine seems to miss a narrowing transform like this:
https://alive2.llvm.org/ce/z/hRy3rE

We do already have a variation of that fold, so it should be a small enhancement. I can take a shot at that.
A demanded bits solution within instcombine might be better, but I'm not seeing how to make that work in general since we have to create 3 instructions from the pattern.

Deoptimize in instcombine in favour of the LV sounds like an anti-pattern. Do you deoptimize in favour of PGO?

Maybe have a instcombine.rst: We optimize anything we can. There are some limits here. We delegate some optimizations to FOO. We never deoptimize in favour of other passes.

@spatel Could you please provide some feedback about this change to InstCombine.

In general, making exceptions to transforms will not solve the general problem. I'm not sure if this is enough, but instcombine seems to miss a narrowing transform like this:
https://alive2.llvm.org/ce/z/hRy3rE

We do already have a variation of that fold, so it should be a small enhancement. I can take a shot at that.
A demanded bits solution within instcombine might be better, but I'm not seeing how to make that work in general since we have to create 3 instructions from the pattern.

From what I understand Instcombine has the narrowing transformation already, but there is an issue with order of instructions visited and which optimization runs first.
For example looking at the debug output, with the changes in this patch:

IC: Visiting:   %conv2 = trunc i32 %add to i8
ICE: EvaluateInDifferentType converting expression type to avoid cast:   %conv2 = trunc i32 %add to i8
IC: Replacing   %conv2 = trunc i32 %1 to i8
with   %add = add i8 %a.0, %0

It converts:

%conv = zext i8 %0 to i32
%conv1 = zext i8 %a.0 to i32
%add = add nsw i32 %conv1, %conv
%conv2 = trunc i32 %add to i8

to:

%add = add i8 %a.0, %0

Without this patch, it first does the optimization to fold the cast into the phi when visiting: conv3 = zext i8 %a.0 to i32

IC: Visiting:   %conv3 = zext i8 %a.0 to i32
  ADD DEFERRED:   %0 = phi i32  
  ADD DEFERRED:   %phi.cast = zext i8 %conv2 to i32
  IC: Replacing   %conv1 = zext i8 %0 to i32
      with   %a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]

And we no longer get the narrowing transformation:

%a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
%conv = zext i8 %0 to i32
%add = add nuw nsw i32 %a.0, %conv 
%inc = add i64 %i.0, 1
%phi.cast = and i32 %add, 255

I am trying to disable the phi cast optimization so that InstCombine can do that narrowing transformation instead.

@spatel Could you please provide some feedback about this change to InstCombine.

In general, making exceptions to transforms will not solve the general problem. I'm not sure if this is enough, but instcombine seems to miss a narrowing transform like this:
https://alive2.llvm.org/ce/z/hRy3rE

We do already have a variation of that fold, so it should be a small enhancement. I can take a shot at that.
A demanded bits solution within instcombine might be better, but I'm not seeing how to make that work in general since we have to create 3 instructions from the pattern.

Oh nevermind, please ignore above comment, I see what you are saying now! That we can add a transformation which can convert:

%a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
%conv = zext i8 %0 to i32
%add = add nuw nsw i32 %a.0, %conv 
%phi.cast = and i32 %add, 255

Back to:

%a.0 = phi i32 [ 0, %entry ], [ %phi.cast, %for.body ]
%conv = trunc i32 %a.0 to i8
%add = add i8 %0, %conv
%conv2 = zext i8 %add to i32

So we don't need to disable the phi cast optimization, but will still get the i8 add.

spatel added a comment.Jun 9 2022, 2:04 PM

Oh nevermind, please ignore above comment, I see what you are saying now! That we can add a transformation which can convert:

Right - see if the motivating example looks better with:
afa192cfb604

nikic added a subscriber: nikic.EditedJun 9 2022, 2:27 PM

The problem with this patch is that it tries to carve out a special case for a specific use-case, which is something we really don't like in InstCombine...

...however, I think there is a real problem here, and it doesn't have anything to do with reductions or vectorization. What this transform is doing is convert op(phi(C, X)) into phi(op(C), op(X)), effectively just moving op from one block to another. This is usually profitable, because it reduces the paths on which op needs to be computed.

There are some exceptions to this. The first is if the X edge is critical, which is something against which the transform already guards. The second is if X is a loop backedge. In that case we either a) move an instruction from one point of a loop to another or b) we might even move an instruction from outside the loop into the loop.

The transform already has a related guard in https://github.com/llvm/llvm-project/blob/70d35fe1257e429266b83025997b400e9f79110e/llvm/lib/Transforms/InstCombine/InstructionCombining.cpp#L1180 which prevents the a) case, which could easily result in an infinite combine loop otherwise. But it does not handle the b) case. For the test case, the transform is actually rooted at %conv3 = zext i8 %a.0 to i32 in for.end (not %conv1 = zext i8 %a.0 to i32 in for.body), for which the reachability check doesn't do anything.

I think the proper fix here would be to adjust that reachability check, to prevent non-profitable transforms involving loop backedges. It should probably be checking reachability from PN->getParent(), not I.getParent().

Edit: Here's a simplified example where just move a trunc from outside to inside a loop for now benefit: https://llvm.godbolt.org/z/7qdvaa6er

llvm/test/Transforms/InstCombine/phi-multiple-zext.ll
2

Please use update_test_checks.py...

nikic added a reviewer: nikic.Jun 9 2022, 2:27 PM

Oh nevermind, please ignore above comment, I see what you are saying now! That we can add a transformation which can convert:

Right - see if the motivating example looks better with:
afa192cfb604

That commit causes breakage for me - one file which used to compile in around a second now hangs (doesn't complete in many minutes at least).

To reproduce, download https://martin.st/temp/remap-preproc.c and try to compile it with clang -target x86_64-w64-mingw32 -w -c -O2 remap-preproc.c.

nikic added a comment.Jun 10 2022, 7:27 AM

I've put up https://reviews.llvm.org/D127499 as an alternative fix.

afa192cfb604

That commit causes breakage for me - one file which used to compile in around a second now hangs (doesn't complete in many minutes at least).

To reproduce, download https://martin.st/temp/remap-preproc.c and try to compile it with clang -target x86_64-w64-mingw32 -w -c -O2 remap-preproc.c.

I reverted that commit. I do have an updated patch that accounts for the problem pattern that caused the infinite loop (blame constant expressions...again). Between that and D127499, I think it's safe to abandon this patch.

afa192cfb604

That commit causes breakage for me - one file which used to compile in around a second now hangs (doesn't complete in many minutes at least).

To reproduce, download https://martin.st/temp/remap-preproc.c and try to compile it with clang -target x86_64-w64-mingw32 -w -c -O2 remap-preproc.c.

I reverted that commit. I do have an updated patch that accounts for the problem pattern that caused the infinite loop (blame constant expressions...again). Between that and D127499, I think it's safe to abandon this patch.

Yes, I tested the motivating example with D127499, and it does resolve the issue. I will abandon this patch now. Thank you.

syzaara abandoned this revision.Jun 13 2022, 7:03 AM