This is an archive of the discontinued LLVM Phabricator instance.

[X86][AArch64][NFC] Add tests for vector masked merge unfolding
ClosedPublic

Authored by lebedev.ri on Apr 24 2018, 6:32 AM.

Details

Summary

This is PR37104.

PR6773 will introduce an IR canonicalization that is likely bad for the end assembly.
Previously, andps+andnps / bsl would be generated. (see @out)
Now, they would no longer be generated (see @in).

Also, i think one more piece is missing - shouldn't the blend instruction be used?

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Apr 24 2018, 6:32 AM

Also, i think one more piece is missing - shouldn't the blend instruction be used?

If you're thinking of pblendvb or similar SSE4.1 instructions: "The mask bits are the most significant bit in each...element."
So those aren't bitwise selects, they're element-size selects. There might be some flavor of AVX512 that has a bsl equivalent? I still don't know AVX512.

I'm usually in favor of more thorough regression testing, but this patch is where computing reality takes over for me. :)
The x86 file is at least 10x bigger than I'd prefer. This level of testing permutation is not sustainable. At some point, running 'make check' will take so long for x86, that people will stop doing it. This may already be happening...personally, I used to run 'make check' for x86 with debug builds, but it takes too long now, so now I only check with release+asserts and even that takes too long on a portable.

This is all just my opinion. Afaik, there are no official LLVM guidelines on this, so if the consensus is that this level of testing is better, I'll live with it.

But there are a few simple ways to shrink this patch:

  1. One illegal vector type (or 1 smaller-than-legal and 1 larger-than-legal) is fine to prove that we don't do something crazy; >1 doesn't add much value.
  2. x86 adds to the vector ISA with every chip, but there's little or no difference to this transform, so SSE1 and AVX1 (and 1 flavor of AVX512?) are all that's useful.
  3. Use multiple '--check-prefix' options per RUN to reduce the number of CHECK lines (I think '--check-prefixes' should work too). See existing vector test files for examples.
lebedev.ri updated this revision to Diff 145470.May 7 2018, 8:38 AM

After actually implementing the vector pattern unfolding, it turned out that most of these tests weren't really interesting,
so i've cut them somewhat.

spatel added inline comments.May 9 2018, 8:53 AM
test/CodeGen/X86/machine-cp.ll
13 ↗(On Diff #145470)

Unrelated white-space diff? This patch is NFC already, so I think it's ok to keep this here or separate if I'm missing something.

test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

Why is this config interesting? IMO, it just distracts from the cases that we do care about, but I may not be seeing it.

731–733 ↗(On Diff #145470)

I still doubt the value of everything above here (illegal vector types), but if you think it shows/proves something...

lebedev.ri added inline comments.May 9 2018, 9:06 AM
test/CodeGen/X86/machine-cp.ll
13 ↗(On Diff #145470)

Yep. This function is affected by the dependent revision, so why not fix this space right away?
¯\_(ツ)_/¯

test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

I think we do want to check that we don't do anything stupid in andn-less case,
much like without BMI in scalar case.

731–733 ↗(On Diff #145470)

I do think it does, note that these illegal vectors improve in D46528.

RKSimon added inline comments.May 9 2018, 9:10 AM
test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll
13 ↗(On Diff #145470)

Irrelevant to this patch but interesting that SSE1 can't do the xops+cmpps approach to generate all-ones (like we do on AVX1 for YMM)

test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

I agree, the non-SSE tests aren't very useful (and I'm a little dubious about SSE1 tbh).

What MIGHT be useful is a XOP pass:

; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse,-sse2 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE1
; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse,+sse2 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE2
; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+xop < %s | FileCheck %s --check-prefix=CHECK,CHECK-XOP

As XOP is the only X86 ISA with a bsl style vector instruction (PCMOV) - maybe add this to both x86 test files?

lebedev.ri added inline comments.May 9 2018, 9:56 AM
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

If i drop baseline tests, which tests will verify what we do in that case?
I know i needed them when working on D46528, i don't think anything else tests that.

Will look into XOP..

  • Add X86 XOP test.
  • Correctly use --check-prefixes=, nice.
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

BTW, is this --check-prefixes= magic documented somewhere?
Especially, why is the CHECK,CHECK-SSE,CHECK-SSE2 the correct order that works,
while CHECK,CHECK-SSE,CHECK-SSE1,CHECK-SSE2 results in lost data?
I'm asking because i have tried the second variant, and failed.
Now that i have seen the first variant, i guess i understand the logic, but still..

spatel added inline comments.May 9 2018, 1:16 PM
test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll
39–41 ↗(On Diff #145958)

What's the reason for using loads in these tests?

test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

I didn't experiment with this particular case, but I think you hit a long-standing shortcoming in update_llc_test_checks.py:

# FIXME: We should use multiple check prefixes to common check lines. For
# now, we just ignore all but the last.

...so it only matches common lines based on the order that they are encountered/created.

lebedev.ri updated this revision to Diff 145997.May 9 2018, 1:43 PM

Use load trick in test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll too, not just in test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll.
Thanks to @spatel for noticing discrepancy.

test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll
39–41 ↗(On Diff #145958)

They cut the count of check-lines for these <4 x i32> tests for SSE1,
since it can't be natively passed as argument.
https://godbolt.org/g/azr1NQ

test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll
2 ↗(On Diff #145470)

Yes, i did read that tool, and did find where the problem happens
/build/llvm/utils/UpdateTestChecks/common.py

def build_function_body_dictionary(function_re, scrubber, scrubber_args, raw_tool_output, prefixes, func_dict, verbose):
.....
    for prefix in prefixes:
      if func in func_dict[prefix] and func_dict[prefix][func] != scrubbed_body:
        if prefix == prefixes[-1]:
          print('WARNING: Found conflicting asm under the '
                               'same prefix: %r!' % (prefix,), file=sys.stderr)
        else: 
          func_dict[prefix][func] = None                                              <-    HERE
          continue

      func_dict[prefix][func] = scrubbed_body

For X86 I think you should remove the 8/16/32/64 vector width tests, I don't see them adding much (type legalization....) - we'd be better off adding 256-bit vector widths.

we'd be better off adding 256-bit vector widths.

I initially had 256 and 512 bit vectors here too, but they did not seem to add any additional coverage,
but added a lot of lines, so i have dropped them. But if you say they are needed, i can totally re-add them, ?

For X86 I think you should remove the 8/16/32/64 vector width tests, I don't see them adding much (type legalization....)

That is basically the same, or opposite question to adding wider vector tests.
The vector widths will be increasing, and it's unlikely we'll test every singe one of them.
But the lower vector widths will stay (8/16/32/64),
so i'm not yet seeing the logic as to why we'd want to test wider vectors, but not shallower vectors.

Add 256-bit vector tests back to X86 tests, as suggested by @RKSimon.
I have kept <128-bit vectors, for now anyway.

One observation i can make - under SSE1, the v8i32 are scalarized, not split into 2x v4i32.
It looks like a bug?

spatel accepted this revision.May 20 2018, 8:56 AM

One observation i can make - under SSE1, the v8i32 are scalarized, not split into 2x v4i32.
It looks like a bug?

I think that's the same thing I mentioned in D46528. We're not transforming to x86-specific opcodes early, so legalization does the safe thing for the illegal types - scalarization. Nobody cares about the perf of SSE1-only at this point, so it's not worth fixing IMO. That also means it's not worth testing so thoroughly here (and I think even x86 fans' eyeballs would fall out before they get anywhere close to confirming those entire asm sequences are correct!).

So I'd second Simon's opinion that we just remove the SSE1 run, but if you insist...LGTM. :)

This revision is now accepted and ready to land.May 20 2018, 8:56 AM

Thank you for the review!

One observation i can make - under SSE1, the v8i32 are scalarized, not split into 2x v4i32.
It looks like a bug?

I think that's the same thing I mentioned in D46528. We're not transforming to x86-specific opcodes early, so legalization does the safe thing for the illegal types - scalarization. Nobody cares about the perf of SSE1-only at this point, so it's not worth fixing IMO. That also means it's not worth testing so thoroughly here (and I think even x86 fans' eyeballs would fall out before they get anywhere close to confirming those entire asm sequences are correct!).

So I'd second Simon's opinion that we just remove the SSE1 run, but if you insist...LGTM. :)

Yeah, i'm conflicted.
I agree that it is not of too much use, but then why do we have separate hasSSE1() and hasSSE2() knobs? :)
And if they are separate, and the tests clearly indicate that it matters here...

So i'll let this sit here for a bit longer, and wait for @RKSimon to re-confirm that they should be dropped.

I agree that it is not of too much use, but then why do we have separate hasSSE1() and hasSSE2() knobs? :)

The knobs are cemented with every bad architectural decision, and the compiler is required to honor that possibility...so x86 baggage keeps growing, but eventually we stop caring about the old stuff because there's no reward in that (just don't crash).

This revision was automatically updated to reflect the committed changes.
llvm/trunk/test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll