This is an archive of the discontinued LLVM Phabricator instance.

[x86] form broadcast of scalar memop even with >1 use
ClosedPublic

Authored by spatel on Feb 5 2020, 1:43 PM.

Details

Summary

The unseen logic diff occurs because MayFoldLoad() is defined like this:

static bool MayFoldLoad(SDValue Op) {
  return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}

The test diffs here all seem ok to me on screen/paper, but it's hard to know if that will lead to universally better perf for all targets. For example, if a target implements broadcast from mem as multiple uops, we would have to weigh the potential reduction of instructions and register pressure vs. possible increase in number of uops. I don't know if we can make a truly informed decision on this at compile-time.

The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the larger example there without at least 1 other fix.

Diff Detail

Event Timeline

spatel created this revision.Feb 5 2020, 1:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2020, 1:43 PM
spatel marked an inline comment as done.Feb 5 2020, 1:47 PM
spatel added inline comments.
llvm/test/CodeGen/X86/vector-reduce-fadd.ll
1109

I haven't stepped through this or the next test diff to see why it changed. Anyone know why SSE2/41 would differ here?

spatel marked an inline comment as done.Feb 6 2020, 11:48 AM
spatel added inline comments.
llvm/test/CodeGen/X86/extract-concat.ll
143–147

I looked closer at this example to see how AVX1 already gets the broadcasts - it's just luck and/or broken logic. We're inconsistently dealing with use-checks.

The test uses i64 elements, but v4i64 shuffles get legalized to v4f64 with AVX1 by bitcasting the load.

So even though we clearly have a load with multiple uses:

t5: v4i64,ch = load<(load 32 from %ir.p)> t0, t2, undef:i64
  t40: v4i64 = vector_shuffle<0,0,0,0> t5, undef:v4i64
  t41: v4i64 = vector_shuffle<1,1,1,1> t5, undef:v4i64
...

After the bitcast is added, it is the *bitcast* node that subsequently has >1 use. But when we lower the shuffle, we peek through bitcasts and find that the load itself only has the one bitcast user.

If we modify this test to use <n x double> types, we get much worse codegen for AVX1:

	vmovapd	(%rdi), %ymm0
	vmovddup	%ymm0, %ymm1    ## ymm1 = ymm0[0,0,2,2]
	vperm2f128	$17, %ymm0, %ymm1, %ymm2 ## ymm2 = ymm1[2,3,2,3]
	vpermilpd	$15, %ymm0, %ymm0 ## ymm0 = ymm0[1,1,3,3]
	vperm2f128	$17, %ymm0, %ymm0, %ymm3 ## ymm3 = ymm0[2,3,2,3]
	vmovapd	(%rdi), %xmm1
	vmovddup	%xmm1, %xmm0    ## xmm0 = xmm1[0,0]
	vinsertf128	$1, %xmm0, %ymm0, %ymm0
	vpermilpd	$3, %xmm1, %xmm1 ## xmm1 = xmm1[1,1]
	vinsertf128	$1, %xmm1, %ymm1, %ymm1

But this patch would solve that problem by eliminating the use check - we'd get the ideal 4 broadcast instructions independent of float/int types.

spatel marked an inline comment as done.Feb 7 2020, 9:53 AM
spatel added inline comments.
llvm/test/CodeGen/X86/vector-reduce-fadd.ll
1109

I stepped through this with debug spew - with SSE4.1, we are using X86ISD::MOVDDUP to splat an f64 from memory. That node eventually gets combined away though.

This is seemingly a win because we removed 2 instructions for the SSE4.1 version - an explicit load with movapd and an unpckhpd.

RKSimon accepted this revision.Feb 14 2020, 8:22 AM

LGTM - in the near future I think we might need to investigate supporting target memory ops in EltsFromConsecutiveLoads, we already have PR32535 as some ZEXT_LOAD cases could be merged back to a wider load - VBROADCAST_LOAD might need to be looked at as well as we create them more aggressively.

This revision is now accepted and ready to land.Feb 14 2020, 8:22 AM
This revision was automatically updated to reflect the committed changes.

This appears to cause suboptimal codegen on a relatively simple pattern:

https://bugs.llvm.org/show_bug.cgi?id=45378