This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] try to shrink bitwise logic with phi operand
Needs ReviewPublic

Authored by spatel on Jan 12 2017, 12:52 PM.

Details

Summary

The test case is based on an example from:
http://www.gdcvault.com/play/1023026/Taming-the-Jaguar-x86-Optimization

bool ArraySearch(int count, uint32_t needle, uint32_t haystack[]) {
  bool found = false;
  for (int i = 0; i < count; ++i)
    found |= needle == haystack[i];
  return found;
}

...so I think the pattern could show up in any "all_of" or "any_of" style of loop. I checked that the loop vectorizer is ok with the change in this case. x86 scalar and vector codegen looks fine with it too.

I'm hoping that D27933 will be approved so we don't have to check the commuted variant (unary cast op is less complex than phi, so it goes to the right side).

Diff Detail

Event Timeline

spatel updated this revision to Diff 84167.Jan 12 2017, 12:52 PM
spatel retitled this revision from to [InstCombine] try to shrink bitwise logic with phi operand.
spatel updated this object.
spatel added reviewers: efriedma, majnemer, hfinkel.
spatel added subscribers: llvm-commits, RKSimon.
efriedma edited edge metadata.Jan 16 2017, 2:56 PM

Would it be possible to come up with some more general optimization? Making integer operations in loops narrower can be useful (for vectorization etc.), but specifically special-casing a boolean "or" reduction doesn't seem like something that will trigger frequently enough to matter.

Would it be possible to come up with some more general optimization? Making integer operations in loops narrower can be useful (for vectorization etc.), but specifically special-casing a boolean "or" reduction doesn't seem like something that will trigger frequently enough to matter.

My laziness of only including one test in the patch misled you. :)
Although 'any_of' and 'all_of' are my motivating cases, this patch is neither limited to 'or' nor to bools. I'll update the tests to show this.

spatel updated this revision to Diff 84777.Jan 17 2017, 4:32 PM

Patch updated:
Add more tests to show that this transform will work with any bitwise logic op and data types.

Err, sorry, it looks like my comment wasn't really clear about what I was thinking about...

The optimization sort of has two weaknesses in its current state: one the optimization doesn't work if the initial value isn't a constant. Two, it only works if there's a single logical operation per loop iteration. The first could probably be fixed using known bits; I'm more worried about the second one... it's probably not something we would deal with in instcombine. Maybe that's not important.

Err, sorry, it looks like my comment wasn't really clear about what I was thinking about...

The optimization sort of has two weaknesses in its current state: one the optimization doesn't work if the initial value isn't a constant. Two, it only works if there's a single logical operation per loop iteration. The first could probably be fixed using known bits; I'm more worried about the second one... it's probably not something we would deal with in instcombine. Maybe that's not important.

For the first, yes, I considered adding phis in demanded bits simplify, but that seemed like overkill for the cases I'm seeing, so I thought a more specific solution was the way to go. For the second, if we have a chain of logic ops, then they should all shrink once we can shrink the first link in that chain. Ie, we already have very similar transforms in foldLogicCastConstant() and shrinkBitwiseLogic(). If you have an example of the pattern you're imagining, we can certainly try to see if the current group of shrinking transforms will handle it?

If you unroll your example loop once, it no longer gets recognized.

Another kind of similar pattern:

int array_max(int n, char *p) {
  int result = 0;
  for (int i = 0; i < n; ++i)
    result = result > p[i] ? result : p[i];
  return result;
}

If you unroll your example loop once, it no longer gets recognized.

Passing -funroll-loops on one of my original loops yields something monstrous. There's no way to solve a case like that without enhancing SimplifyDemandedUseBits()? Ie, we have to chase operands until we 'complete the circle' from incoming phi to logic op. And even that won't work if the unrolling is excessive enough to trip the recursive depth check in SimplifyDemandedBits. I'll abandon this patch if you think the more general solution is the way to go.

for.body:                         
%indvars.iv = phi i64 [ %indvars.iv.unr, %for.body.preheader.new ], [ %indvars.iv.next.3, %for.body ]
%all.010 = phi i8 [ %all.010.unr, %for.body.preheader.new ], [ %and.3, %for.body ]
%arrayidx = getelementptr inbounds i32, i32* %haystack, i64 %indvars.iv
%5 = load i32, i32* %arrayidx, align 4
%cmp1 = icmp eq i32 %5, %needle
%conv = zext i1 %cmp1 to i8
%and = and i8 %conv, %all.010
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%arrayidx.1 = getelementptr inbounds i32, i32* %haystack, i64 %indvars.iv.next
%6 = load i32, i32* %arrayidx.1, align 4
%cmp1.1 = icmp eq i32 %6, %needle
%conv.1 = zext i1 %cmp1.1 to i8
%and.1 = and i8 %conv.1, %and
%indvars.iv.next.1 = add nsw i64 %indvars.iv, 2
%arrayidx.2 = getelementptr inbounds i32, i32* %haystack, i64 %indvars.iv.next.1
%7 = load i32, i32* %arrayidx.2, align 4
%cmp1.2 = icmp eq i32 %7, %needle
%conv.2 = zext i1 %cmp1.2 to i8
%and.2 = and i8 %conv.2, %and.1
%indvars.iv.next.2 = add nsw i64 %indvars.iv, 3
%arrayidx.3 = getelementptr inbounds i32, i32* %haystack, i64 %indvars.iv.next.2
%8 = load i32, i32* %arrayidx.3, align 4
%cmp1.3 = icmp eq i32 %8, %needle
%conv.3 = zext i1 %cmp1.3 to i8
%and.3 = and i8 %conv.3, %and.2
%indvars.iv.next.3 = add nsw i64 %indvars.iv, 4
%exitcond.3 = icmp eq i64 %indvars.iv.next.3, %wide.trip.count
br i1 %exitcond.3, label %for.cond.cleanup.loopexit.unr-lcssa, label %for.body

int array_max(int n, char *p) {

int result = 0;
for (int i = 0; i < n; ++i)
  result = result > p[i] ? result : p[i];
return result;

This case falls into the "don't break min/max" hole that thwarts all kinds of optimization...sigh. Ie, we have something like this in at least 5 places in InstCombine:

// If this is a select as part of a min/max pattern, don't simplify any
// further in case we break the structure.

Also, there's no bitwise logic here. We'd need to enhance select folding to catch it separately from this patch, or this is a good reason to pursue a more general SimplifyDemandedBits solution.