This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold ~A - Min/Max(~A, O) -> Max/Min(A, ~O) - A
ClosedPublic

Authored by dmgreen on Sep 17 2018, 10:00 AM.

Details

Summary

This is an attempt to get out of a local-minimum that instcombine currently gets stuck in. We essentially combine two optimisations at once, ~a - ~b = b-a and min(~a, ~b) = ~max(a, b), only doing the transform if the result is at least neutral. This involves using IsFreeToInvert, which has been expanded a little to include selects that can be easily inverted.

This is trying to fix PR35875, using the ideas from Sanjay. It is a large improvement to one of our rgb to cmy kernels.

Diff Detail

Repository
rL LLVM

Event Timeline

dmgreen created this revision.Sep 17 2018, 10:00 AM
dmgreen added a comment.EditedSep 18 2018, 10:32 AM

Thanks for pointing to D52070. I think I saw that (it gave me the idea for this), but hadn't realised it had come back out.

I'll rebase this once that's in.

dmgreen updated this revision to Diff 166727.Sep 24 2018, 12:01 PM

Turns out there wasn't any conflict, but I've tried to clean this up a little and add a few more tests.

I was hoping to find a more general solution for min/max with nots, but I'm not seeing it, so just a few nits in the inline comments.
Craig's been fighting infinite loops in this area, so let's see if he has any comments on the safety constraints.

lib/Transforms/InstCombine/InstCombineAddSub.cpp
1670 ↗(On Diff #166727)

typo: invertible

1683–1684 ↗(On Diff #166727)

The uses constraints deserve a code comment.

lib/Transforms/InstCombine/InstCombineInternal.h
181 ↗(On Diff #166727)

typo: invertible

181–183 ↗(On Diff #165778)

This is similar to what I was imagining in a comment in D51964, but I'm still not sure if we need to special-case min/max patterns for the extra-uses.

The LHS->hasNUsesOrMore(3) hack in the caller might be enough...

I was hoping to find a more general solution for min/max with nots, but I'm not seeing it, so just a few nits in the inline comments.

I have another I'm afraid. Over in D52508.

Craig's been fighting infinite loops in this area, so let's see if he has any comments on the safety constraints.

Any suggestions on finding these? I've run the testsuite and a bootstrap, I presume that wouldn't catch much?

lib/Transforms/InstCombine/InstCombineInternal.h
181–183 ↗(On Diff #165778)

The LHS->hasNUsesOrMore(3) isn't _meant_ as a hack. Unless you mean it's ugly? It should have 2 uses from the min/max, so 3+ means we wouldn't invert all uses.

The motivating case here (umin3_not_all_ops_extra_uses_invert_subs) is something like min(min(not(a), not(b)), not(c)), but with subs in there too. So there's two min's, one we are folding from, and other we are checking is freely invertible. That's what this is trying to catch.

I was reluctant to make IsFreeToInvert recursive, but that would make it more powerful and remove the need for just checking m_Not's here.

dmgreen updated this revision to Diff 166946.Sep 25 2018, 10:40 AM

Added a comment and correct spelling

spatel added inline comments.Sep 25 2018, 10:51 AM
lib/Transforms/InstCombine/InstCombineInternal.h
181–183 ↗(On Diff #165778)

Yeah - the hack comment was really about the ugliness, and the root cause of that is not having intrinsics for integer min/max...but it'd still be easier reading if we had hasNUsesOrLess()?

spatel accepted this revision.Sep 28 2018, 7:12 AM

Craig's been fighting infinite loops in this area, so let's see if he has any comments on the safety constraints.

Any suggestions on finding these? I've run the testsuite and a bootstrap, I presume that wouldn't catch much?

That's a good first try, but the earlier cases escaped to the wild...though not for long. :)
We end up being able to reduce the problems to relatively short IR tests which seem obvious in hindsight, but I don't know how to spot them in advance any better than what we've done.
I don't have any other feedback, so LGTM.

This revision is now accepted and ready to land.Sep 28 2018, 7:12 AM
dmgreen added a comment.EditedOct 2 2018, 2:49 AM

Thanks

Let me know if you see anything funny from this patch.

This revision was automatically updated to reflect the committed changes.

I'm seeing a regression on goldmont and silvermont cpus in an rgb cmyk conversion benchmark in 32-bit mode. What I've observed is that the 3 subtracts in the code all now have the same LHS register. X86 destroys the LHS of a subtract instruction so we have to make copies before the subtracts. We're in 32-bit mode so our 8-bit register choices are %al, %bl, %cl, %dl, %ah, %bh, %ch, %dh. Silvermont and Goldmont have bad partial register handling for writing high and low 8-bit registers. Unlike Sandy Bridge, Haswell, Skylake, the high and low registers aren't renamed independently. I tried playing around with promoting everything to 32-bits to avoid the partial registers, but that was actually worse somehow.

@dmgreen what target are you using?

dmgreen added subscribers: craig.topper, spatel.EditedOct 11 2018, 3:21 PM

Oh, no. That's not what I wanted to hear. I presume we are looking at the same bit of code!

This was intended to fix things on the rgb kernel, and did pretty well on our tests. I think it was a 45% increase on some cpus, such as the m0plus. You can probably guess this was on Arm, in that case a thumb1 target. It sounds like we were hitting pretty much the opposite conditions, with the old code doing badly around the selects of nots in our case. I had presumed the smaller IR would have produced better code for everyone.

We do end up converting the whole thing to i32 because i8's are not legal types for our registers. We will do things with i8 for vectors on say aarch64 or v8a, but the cortex-m microcontrollers won't have that. I presume that Goldmont is in the same boat, not having vectorisation? (but, perhaps, supporting the instructions?) The vectorised code looked pretty good (especially on aarch64; I tried skylake too, which was better, but had a lot of shuffling going on).

Um, in terms of fixes, I guess our alternatives are make this instcombine dependant on the target, or try to undo it somehow in the backend. My understanding is that the first option is not generally done in instcombine. Can someone give me some history there? Was it something that was decided as a rule, or just something that never came up. (I guess also in this case, not supporting this fold because it happens to cause the extra subs in a random test isn't a great reason to not do it).

The other options would be something like; if there multiple subs with the same (lhs?) operand, and that whole thing is invertible (which may be easy said than done), we invert it all back using ~a - ~b = b - a. I'm not sure how easy that is exactly. This case is a bit of a tangle of multiple uses.
Dave

Goldmont does have SSE4.2 so shoudl support 128-bit vectors. Not sure why we didn't vectorize. Here's the IR for the loop that changed

for.body8.us:                                     ; preds = %for.body8.us, %for.body.us
  %EritePtr.0134.us = phi i8* [ %call.i7, %for.body.us ], [ %incdec.ptr58.us, %for.body8.us ]
  %ReadPtr.0133.us = phi i8* [ %phi.call3.i, %for.body.us ], [ %incdec.ptr10.us, %for.body8.us ]
  %i.0132.us = phi i32 [ 0, %for.body.us ], [ %inc.us, %for.body8.us ]
  %incdec.ptr.us = getelementptr inbounds i8, i8* %ReadPtr.0133.us, i32 1
  %44 = load i8, i8* %ReadPtr.0133.us, align 1, !tbaa !14
  %incdec.ptr9.us = getelementptr inbounds i8, i8* %ReadPtr.0133.us, i32 2
  %45 = load i8, i8* %incdec.ptr.us, align 1, !tbaa !14
  %incdec.ptr10.us = getelementptr inbounds i8, i8* %ReadPtr.0133.us, i32 3
  %46 = load i8, i8* %incdec.ptr9.us, align 1, !tbaa !14
  %47 = icmp ugt i8 %44, %46
  %48 = select i1 %47, i8 %44, i8 %46
  %49 = icmp ugt i8 %48, %45
  %50 = select i1 %49, i8 %48, i8 %45
  %51 = xor i8 %50, -1
  %sub45.us = sub i8 %50, %44
  %sub49.us = sub i8 %50, %45
  %sub53.us = sub i8 %50, %46
  %incdec.ptr55.us = getelementptr inbounds i8, i8* %EritePtr.0134.us, i32 1
  store i8 %sub45.us, i8* %EritePtr.0134.us, align 1, !tbaa !14
  %incdec.ptr56.us = getelementptr inbounds i8, i8* %EritePtr.0134.us, i32 2
  store i8 %sub49.us, i8* %incdec.ptr55.us, align 1, !tbaa !14
  %incdec.ptr57.us = getelementptr inbounds i8, i8* %EritePtr.0134.us, i32 3
  store i8 %sub53.us, i8* %incdec.ptr56.us, align 1, !tbaa !14
  %incdec.ptr58.us = getelementptr inbounds i8, i8* %EritePtr.0134.us, i32 4
  store i8 %51, i8* %incdec.ptr57.us, align 1, !tbaa !14
  %inc.us = add nuw i32 %i.0132.us, 1
  %exitcond138 = icmp eq i32 %inc.us, %mul
  br i1 %exitcond138, label %for.cond5.for.end_crit_edge.us, label %for.body8.us

I think the fact that we're pretty register constrained is hurting my attempt at promoting to 32 bit. We only had 8 32-bit registers to start with. We lost one to the load pointer, one to the store pointer, one to the loop index, one to the stack pointer. One seems to be used by a loop around this one. So that left us 3 registers inside this loop to do the math with.

Theoretically if I could turn the subtracts into add(negate(RHS), LHS) then there should be less register pressure because the RHS of the subtract isn't used again so the negate can happen without copy and then we can put it on the LHS of an add and have no issue overwriting it.

Yeah, that looks like similar IR to what I was looking at. The vectorised version on Skylake (https://godbolt.org/z/RBS2Os) has a lot of shuffling, perhaps that's deemed unprofitable on Goldmont?

I can agree that 8 registers are hard to deal with. Can you explain the "promoting everything to 32-bits", do you mean essentially zext's/truncs around the whole max/max/xor/sub's block? I gave that a try and the sub's still seemed to be using bl's. (it uses cmp's not branches though, which looks better to my untrained eyes).

Um, in terms of fixes, I guess our alternatives are make this instcombine dependant on the target, or try to undo it somehow in the backend. My understanding is that the first option is not generally done in instcombine. Can someone give me some history there? Was it something that was decided as a rule, or just something that never came up. (I guess also in this case, not supporting this fold because it happens to cause the extra subs in a random test isn't a great reason to not do it).

Instcombine is an early IR canonicalization pass. Its purpose is to reduce logically equivalent IR sequences to some common form (usually minimal instruction count). Often, the canonical/minimal IR form also happens to be the maximal perf form for a target, but there's no guarantee on that. It's the backend's job to transform the code for better perf based on target capabilities. So as long as this patch reduced the IR correctly, I don't think it's at fault (although we sometimes temporarily revert to avoid regressions while we fix the later passes). As always, consider an alternate scenario where the benchmark source was already in the logically equivalent form that this patch created. The perf problem already exists for that hypothetical benchmark independent of this patch, so we have to deal with the perf problem some other way.

Looking back at https://bugs.llvm.org/show_bug.cgi?id=35717#c14 ... is this the source for the loop that is causing problems:

void rgb_to_cmyk(char * restrict A, char * restrict B, unsigned I) {
    for (int i = 0; i < I; i++) {
      char xc = *A++;
      char xm = *A++;
      char xy = *A++;

      xc = 255-xc;
      xm = 255-xm;
      xy = 255-xy;

      char xk;
      if (xc < xm)
        xk = xc < xy ? xc : xy;
      else
        xk = xm < xy ? xm : xy;

      xc = xc - xk;
      xm = xm - xk;
      xy = xy - xk;

      *B++ = xk;
      *B++ = xc;
      *B++ = xm;
      *B++ = xy;
    }
}

That vectorizes for an AVX2 target, but not AVX1 or earlier, so we should see if the vectorizer cost model is behaving as expected before dealing with the backend problems?