Page MenuHomePhabricator

[x86] lower extracted fadd/fsub to horizontal vector math
ClosedPublic

Authored by spatel on Dec 21 2018, 9:41 AM.

Details

Summary

This would show up if we fix horizontal reductions to narrow as they go along, but it's an improvement for size and/or Jaguar (fast-hops) independent of that.

We need to do this late to not interfere with other pattern matching of larger horizontal sequences.

I'm guessing we would extend this to integer ops too, but I figured I better stop here and get feedback on the initial FP part in case I've missed something.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Dec 21 2018, 9:41 AM
RKSimon added inline comments.Mon, Dec 31, 7:20 AM
lib/Target/X86/X86ISelLowering.cpp
18324 ↗(On Diff #179303)

We already have shouldCombineToHorizontalOp

spatel marked an inline comment as done.Mon, Dec 31, 9:30 AM
spatel added inline comments.
lib/Target/X86/X86ISelLowering.cpp
18324 ↗(On Diff #179303)

Right - I’m moving/renaming it here because it’s used in both combining and lowering after this patch. I can make that an NFC pre-commit.

RKSimon added inline comments.Mon, Dec 31, 11:54 AM
lib/Target/X86/X86ISelLowering.cpp
18324 ↗(On Diff #179303)

Yes, please do the NFC pre-commit first

spatel updated this revision to Diff 179798.Tue, Jan 1, 10:14 AM

Patch updated:
No functional diff from the previous rev, but rebased after pre-committing the move/rename of the helper with rL350193.

One thing I'm not sure about - do we want to enable this for AVX512-sized vectors too? Ie, can we assert that the extract is from 128/256/512-bit instead of checking that explicitly?

RKSimon added inline comments.Wed, Jan 2, 2:55 AM
lib/Target/X86/X86ISelLowering.cpp
18339 ↗(On Diff #179798)

Do both ops have to be one use for this to be useful?

18358 ↗(On Diff #179798)

Is it useful to support any suitable neighbouring pairs of extracted values?

18364 ↗(On Diff #179798)

Is this important? I think its what you were asking about AVX512 (which doesn't have HADDPS). I'd start by adding avx512f/avx512vl tests to haddsub-undef.ll

spatel marked 6 inline comments as done.Wed, Jan 2, 9:00 AM
spatel added inline comments.
lib/Target/X86/X86ISelLowering.cpp
18339 ↗(On Diff #179798)

It's not as clearly a win, but I'll ease that constraint, and we can decide if it looks better.

18358 ↗(On Diff #179798)

Yes, but arguably the other patterns are less likely (the 0/1 case is what I'm seeing from adjusting the expansion of horizontal reductions). Let me make that a TODO.

18364 ↗(On Diff #179798)

Right - we would want to treat 512-bit extract identically to 256-bit. Ie, it doesn't really matter that there's no 512 version of haddps, we want to narrow it to a 128-bit vector first either way.

spatel updated this revision to Diff 179860.Wed, Jan 2, 9:14 AM
spatel marked 3 inline comments as done.

Patch updated:

  1. Ease extra uses constraint - we might still be able to avoid a shuffle/extract.
  2. Allow 512-bit vectors.
  3. Add TODO to for other extract index pairs.
spatel marked an inline comment as done.Thu, Jan 3, 10:11 AM
spatel added inline comments.
test/CodeGen/X86/haddsub.ll
1034 ↗(On Diff #179860)

This wasn't clear from the patch update comment: we have 512-bit tests and AVX512 RUN lines. However, the output is identical to regular AVX1 because we're just extracting the low 128-bits.

RKSimon accepted this revision.Thu, Jan 3, 10:43 AM

LGTM - please can you add AVX512 tests to haddsub-undef.ll - we also need AVX-SLOW/AVX-FAST common prefixes as well

This revision is now accepted and ready to land.Thu, Jan 3, 10:43 AM
spatel added a comment.Thu, Jan 3, 3:18 PM

LGTM - please can you add AVX512 tests to haddsub-undef.ll - we also need AVX-SLOW/AVX-FAST common prefixes as well

Rearranged/added tests as preliminary commits:
rL350356
rL350357
rL350358
rL350362
rL350364

This revision was automatically updated to reflect the committed changes.
spatel reopened this revision.Thu, Jan 3, 4:29 PM

Reopening - I reverted with rL350373.
I didn't realize the custom legalization would affect the cost models used by the vectorizers; there are potentially IR test changes that go with this.

This revision is now accepted and ready to land.Thu, Jan 3, 4:29 PM
spatel planned changes to this revision.Thu, Jan 3, 4:29 PM
spatel updated this revision to Diff 180181.Thu, Jan 3, 5:11 PM

Patch updated:
Include cost model and SLP test changes. Should we be overriding the cost of scalar FADD/FSUB, so we don't have these diffs?

Without that, it looks like we fall back to this questionable logic:

// If the operation is custom lowered, then assume that the code is twice
// as expensive.
This revision is now accepted and ready to land.Thu, Jan 3, 5:11 PM
RKSimon requested changes to this revision.Fri, Jan 4, 1:08 AM

Patch updated:
Include cost model and SLP test changes. Should we be overriding the cost of scalar FADD/FSUB, so we don't have these diffs?

Yes - please add the relevant cost changes so the slp/cost test changes go away.

Without that, it looks like we fall back to this questionable logic:

// If the operation is custom lowered, then assume that the code is twice
// as expensive.

That is some VERY questionable logic.....

This revision now requires changes to proceed.Fri, Jan 4, 1:08 AM
spatel updated this revision to Diff 180245.Fri, Jan 4, 7:25 AM

Patch updated:
Add cost model overrides for FADD/FSUB to preserve existing behavior.
I suspect that these should be adjusted, and we should include more opcodes to avoid the default trap, but that's another patch. This maintains the current state, so no test diffs outside of codegen.

spatel marked an inline comment as done.Fri, Jan 4, 7:50 AM
spatel added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
839–842 ↗(On Diff #180245)

Looking harder at Agner's numbers - I didn't remember that P4 actually had 2-cycle throughput for addsd/subsd. So this is correct assuming the baseline for SSE2 is a P4 (willamette, prescott).

But the baseline chip for SSE1 below here is a P3, and that had 1-cycle throughput for addss/subss according to Agner.

I'll fix the comments.

spatel updated this revision to Diff 180247.Fri, Jan 4, 8:07 AM

Patch updated:
Updated cost model changes - for SSE2 (P4), the default 2-cycle throughput cost matches Agner.
For SSE1 (P3), the default 2-cycle throughput cost does not match a P3 implementation (but hopefully nobody cares at this point).

spatel updated this revision to Diff 180252.Fri, Jan 4, 9:05 AM

Patch updated:
Rebased after cost model enhancements in rL350403. So now this patch is once again independent of any IR diffs.

RKSimon accepted this revision.Fri, Jan 4, 9:19 AM

LGTM thanks

This revision is now accepted and ready to land.Fri, Jan 4, 9:19 AM
This revision was automatically updated to reflect the committed changes.