Page MenuHomePhabricator

[CGP] form usub with overflow from sub+icmp
ClosedPublic

Authored by spatel on Feb 5 2019, 2:27 PM.

Details

Summary

The motivating x86 cases for forming the intrinsic are shown in PR31754 and PR40487:
https://bugs.llvm.org/show_bug.cgi?id=31754
https://bugs.llvm.org/show_bug.cgi?id=40487
...and those are shown in the IR test file and x86 codegen file.

Matching the usubo pattern is harder than uaddo because we have 2 independent values rather than a def-use.
I replicated the codegen tests for AArch64 to show that forming usubo should be generally good (and that lines up with the existing uaddo sibling transform).

There's a potential regression seen in the AMDGPU test file when trying to form a SAD op though, so I added a hack to try to avoid that.
If I'm seeing the PPC changes correctly, this is a small improvement on those tests.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Feb 5 2019, 2:27 PM
spatel updated this revision to Diff 185419.Feb 5 2019, 3:05 PM

Patch updated:
No code changes, but the previous rev had a glitch in the IR test file.

dmgreen added a subscriber: dmgreen.Feb 6 2019, 4:49 AM

I'll apply this and try it to see what is changed in the test cases. I'll report back when that's done.

test/CodeGen/PowerPC/bdzlr.ll
57 ↗(On Diff #185419)

I have to look at the entire codegen here a little more closely. The reason I think this requires deeper investigation is that it is changing a branch from the CTR-decrementing bdzlr to the bnelr that doesn't touch the CTR. So I want to make sure we are not preventing the formation of CTR loops by this change.

spatel marked an inline comment as done.Feb 8 2019, 6:36 AM
spatel added inline comments.
test/CodeGen/PowerPC/bdzlr.ll
57 ↗(On Diff #185419)

Thanks. If this looks like a regression, then we're probably better off adding a TLI hook to guard against this (and the existing UADDO) transforms. That would let me remove the SAD hack and allow targets to enable this at their convenience.

D57833 also suggests that might be the better (more conservative) way forward.

@dmgreen @arsenm @nhaehnle - any preference/thoughts about that?

Hello. I ran some benchmarks and they were kind of all over the place, but on average a little down. The Arm backend seems to be quite opinionated about the uaddo's, using it's own nodes for a lot of things, so there might be some missing parts. There were enough ups in there to make us think that this can be good (and can show areas of improvement like in D57833), but we may need a few fixes first. Putting it behind a target hook in the meantime whilst we take a look does sound sensible.

I haven't ran any AArch64 tests, just Arm (we had some infrastructure failures that need some emergency fixes). It may be better there already.

spatel updated this revision to Diff 186286.Feb 11 2019, 10:28 AM

Patch updated:
Side-step controversy by adding a TLI hook. This should preserve existing behavior for uaddo and usubo for all targets, except x86 which now allows usubo formation. It seems likely that other targets like AArch64, AMDGPU, and PPC should allow usubo too, but they can do that selectively as follow-ups.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2019, 10:28 AM
Herald added a subscriber: hiraditya. · View Herald Transcript

Thanks. We will try and figure out some of the changes, see if we can get this enabled.

I'l otherwise leave this review to people more familiar with x86 :)

Independent of this patch, but just so everyone's aware - there's currently no consistency in the way we transform to the overflow intrinsics. We may transform to sadd.with.overflow as an IR canonicalization (no target checks):
https://godbolt.org/z/2ajU23

I don't think there's any place in the IR optimizer or backend that creates ssub.with.overflow from raw ops at the moment.

RKSimon accepted this revision.Feb 18 2019, 9:25 AM

LGTM - cheers

This revision is now accepted and ready to land.Feb 18 2019, 9:25 AM
This revision was automatically updated to reflect the committed changes.
Carrot added a subscriber: Carrot.Mar 14 2019, 11:45 AM

This patch causes 5% regression of one of our eigen benchmarks on Haswell.

The problem is when it combines the CMP in a hot block with SUB in a cold block into a single SUB in hot block, on a two address architecture like x86, if the operand of CMP has other uses, it needs to make an extra COPY before the original CMP, so there is one more instruction in hot block.

Another patch r355823 papered over the problem in our code, but it didn't fix the root cause.

The regression is only observed on Haswell, it doesn't impact Skylake.

This patch causes 5% regression of one of our eigen benchmarks on Haswell.

The problem is when it combines the CMP in a hot block with SUB in a cold block into a single SUB in hot block, on a two address architecture like x86, if the operand of CMP has other uses, it needs to make an extra COPY before the original CMP, so there is one more instruction in hot block.

Another patch r355823 papered over the problem in our code, but it didn't fix the root cause.

The regression is only observed on Haswell, it doesn't impact Skylake.

Thanks for letting me know. We could limit this transform based on profile metadata (or is there some other heuristic to determine that 1 block is hot and the other is cold?). In the absence of that information, I think this is the right theoretical optimization at this layer as shown in the improvements in the tests with this patch. If you have a test that shows the problem, I can take a look.

A bug https://bugs.llvm.org/show_bug.cgi?id=41129 is filed for the regression.
Thanks a lot for the investigation.
Please let me know if more information is required.