This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen][X86] Expand vector USUBSAT to UMAX+SUB
ClosedPublic

Authored by nikic on Jan 12 2019, 4:12 AM.

Details

Summary

Related to https://bugs.llvm.org/show_bug.cgi?id=40123.

Rather than scalarizing, expand a vector USUBSAT into UMAX+SUB, which produces much better code for X86. I've updated the cost tables on the assumption that pmaxud has a cost of 1, not sure if that's correct. Agner lists it as latency 1, and recip throughput 0.5 or 1 depending on uarch.

Diff Detail

Event Timeline

nikic created this revision.Jan 12 2019, 4:12 AM
RKSimon added inline comments.Jan 12 2019, 4:34 AM
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
340

Add ISD::USUBSAT et al here as well?

lib/CodeGen/SelectionDAG/TargetLowering.cpp
5277–5278

Really pedantic, but this function name is massive - why not just TargetLowering::expandAddSubSat ?

5287

isOperationLegalOrCustom?

lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

LowerADDSAT_SUBSAT supports AVX1 splitting, so why not:

setOperationAction(ISD::USUBSAT, MVT::v8i32, HasInt256 ? Legal : Custom);

Also, v4i64 should be ok as well.

nikic marked 4 inline comments as done.Jan 12 2019, 5:15 AM
nikic added inline comments.
lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
340

It's already present a bit lower (L413).

lib/CodeGen/SelectionDAG/TargetLowering.cpp
5287

Extra diff allowing custom: https://gist.github.com/nikic/4c46634cec8f319e687c6b5cb0496648

This is presumably better than scalarizing, but I was wondering if a more explicit expansion would work better?

As another variant, this is what happens if I just fall through to the USUBO+SELECT expansion: https://gist.github.com/nikic/d989121f7f9898437a1255548c148904

lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

I've added this to get the splitting, but I don't think v8i32 would be legal, as we need it to expand. Maybe hasInt256 ? Expand : Custom would make it more obvious what the intention here is?

RKSimon added inline comments.Jan 12 2019, 5:25 AM
lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

Better for the existing x86 lowering code to handle it IMO - that's what we do for nearly all 256-bit integer code on AVX1 targets.

nikic marked 2 inline comments as done.Jan 12 2019, 6:07 AM
nikic added inline comments.
lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

That's what this is intended to do. v8i32 is custom so lowering splits it into two v4i32s, which then get expanded.

If I change this to HasInt256 ? Legal : Custom, then I get:

LLVM ERROR: Cannot select: t5: v8i32 = usubsat t2, t4

Or am I misunderstanding what you have in mind here?

RKSimon added inline comments.Jan 12 2019, 1:59 PM
lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

If you look at LowerABS just below LowerADDSAT_SUBSAT you'll see the approach that I took.

nikic updated this revision to Diff 181471.Jan 13 2019, 6:21 AM

Rename excessively long method name. Change handling of AVX splitting. Move AVX512 costs to right category (pmaxuq is in AVX512F, so these shouldn't be in the BW cateogory).

nikic marked 3 inline comments as done.Jan 13 2019, 6:24 AM
nikic added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5277–5278

Agreed, I've renamed the method.

lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

I've changed this to always Custom and changed the LowerADDSAT_SUBSAT implementation to look more like LowerABS. Is this right?

RKSimon added inline comments.Jan 13 2019, 6:58 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5277–5278

Thanks - if you can, please pull this out and commit this as a NFC straightaway.

nikic updated this revision to Diff 181477.Jan 13 2019, 9:45 AM
nikic marked an inline comment as done.

Rebase over NFC commit.

RKSimon added inline comments.Jan 13 2019, 10:22 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5287

Allowing legalorcustom looks OK to me - we can tweak per-target codegen in future commits if its useful.

Also, should we try to do this on scalars as well before defaulting to add/sub overflow? For instance AMDGPU i32 for instance should benefit.

lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)
setOperationAction(ISD::USUBSAT, MVT::v8i32, HasInt256 ? Legal : Custom);
RKSimon added inline comments.Jan 13 2019, 10:23 AM
lib/Target/X86/X86ISelLowering.cpp
1224 ↗(On Diff #181438)

Sorry, ignore that!

nikic updated this revision to Diff 181480.Jan 13 2019, 11:32 AM
nikic marked an inline comment as done.

Use LegalOrCustom. Use UMAX+SUB pattern for scalar as well. Remove no longer necessary X86 specific changes.

nikic marked 3 inline comments as done.Jan 13 2019, 11:36 AM
nikic added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
5287

Okay, I did both changes. I agree that preferring this expansion makes sense even for scalar.

I've also dropped the changes in X86ISelLowering. Now that custom is allowed here, they are no longer necessary. We'll just expand to a wide UMAX here that will get split (rather than the USUBSAT getting split and then each half expanded to UMAX).

X86 vXi64 costs?

nikic marked an inline comment as done.Jan 14 2019, 5:59 AM

X86 vXi64 costs?

I've added vXi64 costs only for AVX512, as pmaxuq wasn't available previously. Should I also add costs based on the broadcast+2*xor+cmpgt+blend+sub sequence we get if there is no native umax?

RKSimon accepted this revision.Jan 14 2019, 7:21 AM

X86 vXi64 costs?

I've added vXi64 costs only for AVX512, as pmaxuq wasn't available previously. Should I also add costs based on the broadcast+2*xor+cmpgt+blend+sub sequence we get if there is no native umax?

Actually, thinking about this more - we'd be better off doing the non-legal umax cases in the fallback generic cost code that I mentioned in PR40123. I think that can wait for a followup patch.

LGTM.

This revision is now accepted and ready to land.Jan 14 2019, 7:21 AM
This revision was automatically updated to reflect the committed changes.
nikic reopened this revision.Jan 14 2019, 2:24 PM

Gah, apparently I didn't run SLPVectorizer tests, which of course also change. Getting a WARNING: Found conflicting asm under the same prefix: 'AVX2' warning when regenerating those, so I've reverted for now and will look again tomorrow.

This revision is now accepted and ready to land.Jan 14 2019, 2:24 PM

Gah, apparently I didn't run SLPVectorizer tests, which of course also change. Getting a WARNING: Found conflicting asm under the same prefix: 'AVX2' warning when regenerating those, so I've reverted for now and will look again tomorrow.

I've updated those SLP tests to fix the conflict, please rebase and regenerate+check the SLP tests again

nikic updated this revision to Diff 181794.Jan 15 2019, 8:33 AM

Rebase and update SLPVectorizer tests.

RKSimon accepted this revision.Jan 15 2019, 8:48 AM

LGTM - thanks

This revision was automatically updated to reflect the committed changes.