This is an archive of the discontinued LLVM Phabricator instance.

[x86, SSE] AVX1 PR28129
ClosedPublic

Authored by dtemirbulatov on Apr 23 2017, 11:33 PM.

Details

Summary

Further perf tests on Jaguar indicate that:

vxorps  %ymm0, %ymm0, %ymm0
vcmpps  $15, %ymm0, %ymm0, %ymm0

is consistently faster (by about 9%) than:

vpcmpeqd  %xmm0, %xmm0, %xmm0
vinsertf128  $1, %xmm0, %ymm0, %ymm0

Testing equivalent code on a SandyBridge (E5-2640) puts it slightly (~3%) faster as well.
OK to commit, this change is only related to AVX1?

Diff Detail

Repository
rL LLVM

Event Timeline

dtemirbulatov created this revision.Apr 23 2017, 11:33 PM

here is test-case:
define <4 x double> @cmp256_domain(<4 x double> %a) {

%cmp = fcmp oeq <4 x double> zeroinitializer, zeroinitializer
%sext = sext <4 x i1> %cmp to <4 x i64>
%mask = bitcast <4 x i64> %sext to <4 x double>
%add = fadd <4 x double> %a, %mask
ret <4 x double> %add

}
So, we are changing "immAllOnesV" for 256-bit vectors only for AVX1 machine.

spatel added inline comments.
lib/Target/X86/X86InstrSSE.td
7754–7755 ↗(On Diff #96347)

This comment should be updated to match the new code.

Is it correct that this pattern won't apply to most integer code for an AVX target because that would already be legalized to v4i32/v2i64? If that's true, I think it's also worth mentioning here.

I'm imagining cases like this:

define <8 x i32> @cmpeq_v8i32(<8 x i32> %a) nounwind {
 %cmp = icmp eq <8 x i32> %a, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
  %res = sext <8 x i1> %cmp to <8 x i32>
 ret <8 x i32> %res
}

define <8 x i32> @cmpne_v8i32(<8 x i32> %a) nounwind {
  %cmp = icmp ne <8 x i32> %a, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
  %res = sext <8 x i1> %cmp to <8 x i32>
  ret <8 x i32> %res
}

define <8 x i32> @sub1_v8i32(<8 x i32> %a) nounwind {
  %add = add <8 x i32> %a, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
  ret <8 x i32> %add
}
7758 ↗(On Diff #96347)

It's not clear why we require a zero operand. Would a dummy (undef) register also work? Should we allow that when optimizing for size so the vxorps is not needed?

test/CodeGen/X86/vector-pcmp.ll
156–158 ↗(On Diff #96347)

That's an interesting case...that we probably can't answer at the DAG level. Would it be better to use two 128-bit vpxor instructions instead of incurring a potential domain-crossing penalty with the one 256-bit vxorps?

spatel added a subscriber: llvm-commits.
RKSimon added inline comments.Apr 24 2017, 8:50 AM
lib/Target/X86/X86InstrSSE.td
7758 ↗(On Diff #96347)

There isn't a fast path for vcmptrue (despite it ignoring the inputs) on Jaguar/SandyBridge - I mentioned in https://bugs.llvm.org/show_bug.cgi?id=28129#c8 that using undef vars causes dependency regressions. Zeroing the register breaks the dependency.

RKSimon added inline comments.Apr 25 2017, 3:23 AM
lib/Target/X86/X86InstrSSE.td
7754–7755 ↗(On Diff #96347)

This comment should be updated to match the new code.

+1

test/CodeGen/X86/vector-pcmp.ll
156–158 ↗(On Diff #96347)

Do you mean this?

vextractf128 $1, %ymm0, %xmm1
vpxor %xmm2, %xmm2, %xmm2
vpcmpgtb %xmm1, %xmm2, %xmm1
vpcmpgtb %xmm0, %xmm2, %xmm0
vcmpeqd %xmm2, %xmm2, %xmm2
vpxor %xmm2, %xmm1, %xmm1
vpxor %xmm2, %xmm0, %xmm0
vinsertf128 $1, %xmm1, %ymm0, %ymm0
spatel added inline comments.Apr 25 2017, 7:34 AM
test/CodeGen/X86/vector-pcmp.ll
156–158 ↗(On Diff #96347)

Yes - I remember reading somewhere (and not sure how widely this applies) that the 'insertX128' insts may not actually have domain-crossing penalties. The other variable in this mix (thinking about Jaguar here) is that the 256-bit ops may be cracked and double-pumped anyway, so if we have that + domain-crossing penalty, then the two 128-bit insts should be faster?

RKSimon added inline comments.Apr 25 2017, 8:07 AM
test/CodeGen/X86/vector-pcmp.ll
156–158 ↗(On Diff #96347)

A quick hot loop test suggests that the old vpcmpeqd+vinsertf128+xor approach takes 8cy, the 256-bit xor+vcmptrueps+xor approach takes 7cy and the 128-bit vpcmpeqd+2*xor takes 6cy on Jaguar.

It might be worth looking at splitting some 256-bit bitwise operations that take concatenated 128-bit operations, but I don't think it should get in the way of this patch.

spatel added inline comments.Apr 25 2017, 9:10 AM
test/CodeGen/X86/vector-pcmp.ll
156–158 ↗(On Diff #96347)

Agreed - the splitting problem is separate:
https://bugs.llvm.org/show_bug.cgi?id=32790

RKSimon edited edge metadata.Apr 26 2017, 3:00 AM

Almost there - the only other improvement I can think of would be to use vcmptrueps(undef, undef) for OptForSize - doesn't break the dependency so slight perf regression in exchange for no vxorps.

zvi edited edge metadata.Apr 26 2017, 2:17 PM

Here's what IACA reports for Sandybridge:

Before:

| Num Of |              Ports pressure in cycles              |      |
|  Uops    |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |      |
|   1        |                | 0.3 |              |              |      | 0.7 |       | vpcmpeqd xmm1, xmm1, xmm1
|   1        |               |       |               |              |      | 1.0 |      | vinsertf128 ymm1, ymm1, xmm1, 0x1
|   1        | 1.0         |       |               |              |      |       | CP | vblendps ymm0, ymm1, ymm0, 0xaa
|   1        |               | 1.0 |               |              |      |       | CP | vcvtdq2ps ymm0, ymm0

After:

| Num Of |              Ports pressure in cycles              |      |
Uops0 - DV12 - D3 - D45
1*vxorps ymm1, ymm1, ymm1
11.0vcmpps ymm1, ymm1, ymm1, 0xf
10.50.5CPvblendps ymm0, ymm1, ymm0, 0xaa
11.0CPvcvtdq2ps ymm0, ymm0

\* - instruction micro-ops not bound to a port

update after review process.

another update after issue with -Os.

update after issue -Os was cleared.

spatel edited edge metadata.Apr 29 2017, 12:56 PM

This looks almost done. I have one question about -Os vs. -Oz, and one nit about a code comment.

lib/Target/X86/X86InstrSSE.td
489 ↗(On Diff #97091)

Should this be "OptForMinSize"? Ie, I was just looking at some other code patterns, and we are inserting xorps/xorpd even at -Os.

In fact, we're doing that even at -Oz which I think is a mistake.

7760–7761 ↗(On Diff #97091)

I'd add a little more explanation here just to make it clearer:
"To create a 256-bit all ones value, we use VCMPTRUEPS and zero out the fake register operand to avoid false dependencies."

spatel added inline comments.Apr 30 2017, 8:11 AM
lib/Target/X86/X86InstrSSE.td
489 ↗(On Diff #97091)

After thinking about this again, I'm now wondering why we would make the zero reg input a part of the patterns here. Are there any other td patterns that do this? For the most part, I think we handle the xor generation using ExecutionDepsFix.

Rebased and changed OptForSize to OptForMinSize.

spatel added inline comments.May 12 2017, 12:55 PM
test/CodeGen/X86/all-ones-vector.ll
1 ↗(On Diff #98799)

Many cosmetic diffs have been introduced because you used a different script. Please use "update_llc_test_checks.py" to regenerate these and upload the patch again. I need to "fix" the other script to not work with llc. :)

updated tests after review with proper update_llc_test_checks.py.

RKSimon accepted this revision.May 12 2017, 3:06 PM

LGTM - with a future patch to investigate using ExecutionDepsFix

This revision is now accepted and ready to land.May 12 2017, 3:06 PM
This revision was automatically updated to reflect the committed changes.