This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner][x86] add transform/hook to load a scalar directly for use in a vector binop
Needs ReviewPublic

Authored by spatel on Aug 31 2018, 12:38 PM.

Details

Summary

This is the reversal for the proposed IR canonicalization in D50992 (insert+vector op --> scalar op+insert).

I've enabled x86 in the minimal way because this looks like a close call for most recent Intel. They have fast (1uop / 1 cycle latency) transfer from GPR to *MM according to Agner / llvm-mca. And the current AVX2 code is likely too broadcast-happy as seen in the test diffs.

So this doesn't exercise the recent broadcast improvements from D51125 / D51186 yet.

Enabling more opcodes/types (the test file covers all 18 IR-equivalent binops) looks tricky. For example, we should have gotten the 'and' with i32 test, but v4i32 is promoted to v2i64. Pre-SSE4, we don't have pmulld, so we can't uniformly allow isLegalOrCustom().

Diff Detail

Event Timeline

spatel created this revision.Aug 31 2018, 12:38 PM
spatel edited the summary of this revision. (Show Details)Sep 2 2018, 6:46 AM

This looks reasonable to me, some nits.

include/llvm/CodeGen/TargetLowering.h
535

You probably want to add some explanatory comment here.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15147

Do we need to check that BO.getOpcode() is actually valid for VecVT?
Or are we ok with scalarization?

test/CodeGen/X86/load-scalar-as-vector.ll
23

Why is there a vpbroadcastd here, as compared to AVX1 version?

spatel marked 2 inline comments as done.Sep 12 2018, 3:27 PM
spatel added inline comments.
include/llvm/CodeGen/TargetLowering.h
535

Oops - yes, I forgot that.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15147

Scalarization would almost certainly defeat the point of this transform, but I think it's better to have all of those checks in one place - in the target hook.

test/CodeGen/X86/load-scalar-as-vector.ll
23

For AVX2 targets, we assume that a load+broadcast costs no more than a full vector load, but it saves data space, so that's the default. I suspect that we need to revisit that per-CPU and see if that's actually true.

spatel marked 2 inline comments as done.Sep 12 2018, 3:31 PM
spatel added inline comments.
test/CodeGen/X86/load-scalar-as-vector.ll
23

Oops - in this case, we loaded a scalar though, so there must be something going wrong with our undef knowledge.

spatel updated this revision to Diff 165173.Sep 12 2018, 3:49 PM

Patch updated:
Added documentation comment for the new target hook.

My main concern is the growth in the constant pool due to vector constants.

My main concern is the growth in the constant pool due to vector constants.

I agree that that is a trade-off, but I don't think we can make an informed decision about it at this level. We can fine-tune the hook in this patch if we see regressions, but we're still just guessing at the eventual global outcome.
So without that kind of knowledge, we have to look at the micro-motivation.

Here's llvm-mca output using scalar math on btver2 (Jaguar):

[0,0]     DeeeeeER  .    ..   movl	(%rdi), %eax
[0,1]     D=====eER .    ..   addl	$42, %eax
[0,2]     .D=====eeeeeeeeER   movd	%eax, %xmm0  <--- move to vector register has 8-cycle latency?

vs. vector op:

[0,0]     DeeeeeER  .   movd	(%rdi), %xmm0
[0,1]     D==eeeeeeER   paddd	32(%rip), %xmm0

Seems like the right choice to do the transform here?

andreadb added inline comments.Sep 21 2018, 10:20 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15103–15105

I wonder if this should be disabled for optsize/minsize.
On x86, this is going to increase .text, and it also potentially introduces a new constant in the pool.

mov (%rdi), %eax
add $0x2a, %eax
vmovd %eax, %xmm0
retq

is 10 bytes.

vmovd (%rdi), %xmm0
vpaddd 0x0(%rip), %xmm0, %xmm0  ## 4 bytes PC relative relocation
retq

is 13 bytes.

Plus, 16B in the constant pool (if we didn't see that constant before)...

spatel marked an inline comment as done.Sep 21 2018, 11:07 AM
spatel added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15103–15105

Yes, that seems like a good constraint for x86. I'm not sure that other targets would want the same limitation though, so I think we should make that another option in the target hook itself rather than here in generic code.

spatel updated this revision to Diff 166519.Sep 21 2018, 11:11 AM
spatel marked an inline comment as done.

Patch updated:
Disallow the transform on x86 if we're optimizing for size. Test added at rL342755.

RKSimon added inline comments.Sep 29 2018, 3:50 AM
test/CodeGen/X86/load-scalar-as-vector.ll
5

Please can you rebase as I added avx512 tests at rL342773 and possibly add some tests with 256/512 vectors as well?

spatel added inline comments.Oct 1 2018, 9:32 AM
test/CodeGen/X86/load-scalar-as-vector.ll
5

Sure - there's no difference for AVX512 with the current set of tests on this patch, but there should be after rL343491.

spatel updated this revision to Diff 167752.Oct 1 2018, 10:02 AM

Patch updated:
No code changes, but I added 256/512-bit versions of the 128-bit tests that are affected by this patch.

The new results show that we are only transforming in the cases where we have legal ops for the type. Ie, this transform does not fire after we split the big vectors into the smaller legal vector type supported by the target. That's because we don't have an insert_vector_elt at that stage; it's a scalar_to_vector post-legalization.

So we can view the wider cases as conservatively ok here, but we should handle the pattern with scalar_to_vector as a follow-up. Or we might say that we don't want this transform to ever fire for bigger vector types? It's harder to justify this transform with the longer vectors if those instructions are executed as multi-uop sequences and/or have effects that we can't model (throttling the clock rate). OTOH, the transform is limited to insertion into element 0 currently, so the fact that we're using wide vector ops in these tests isn't necessary in the first place.

For an insert into lane zero, it's never profitable to generate ymm/zmm instructions; we can just generate an xmm instruction and implicitly extend the result.

For an insert into lane zero, it's never profitable to generate ymm/zmm instructions; we can just generate an xmm instruction and implicitly extend the result.

Right - that's what I was suggesting at the end of my last reply. So do we want to put this patch on hold to improve the undef/splat behaviors seen here first, or proceed here and file bugs for those problems?

What's the status here?

spatel updated this revision to Diff 206063.Jun 21 2019, 1:30 PM

Patch rebased - at first glance, we haven't made much progress on the issues raised months ago despite dozens of narrowing/shrinking/demanded bits/demanded elts patches.

Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2019, 1:30 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
lebedev.ri requested changes to this revision.Jul 10 2019, 3:30 PM

(just removing from review queue)

This revision now requires changes to proceed.Jul 10 2019, 3:30 PM
spatel updated this revision to Diff 269880.Jun 10 2020, 9:41 AM

Rebased to apply cleanly to current master, but the test diffs remain unchanged. We are still creating vector ops that are too wide.

This seems easier to implement in vector-combine now that we have it. We will need to make sure that we are not fighting the recently added scalarization transforms.

lebedev.ri resigned from this revision.Jan 12 2023, 5:15 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 5:15 PM