Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Also, i think one more piece is missing - shouldn't the blend instruction be used?
If you're thinking of pblendvb or similar SSE4.1 instructions: "The mask bits are the most significant bit in each...element."
So those aren't bitwise selects, they're element-size selects. There might be some flavor of AVX512 that has a bsl equivalent? I still don't know AVX512.
I'm usually in favor of more thorough regression testing, but this patch is where computing reality takes over for me. :)
The x86 file is at least 10x bigger than I'd prefer. This level of testing permutation is not sustainable. At some point, running 'make check' will take so long for x86, that people will stop doing it. This may already be happening...personally, I used to run 'make check' for x86 with debug builds, but it takes too long now, so now I only check with release+asserts and even that takes too long on a portable.
This is all just my opinion. Afaik, there are no official LLVM guidelines on this, so if the consensus is that this level of testing is better, I'll live with it.
But there are a few simple ways to shrink this patch:
- One illegal vector type (or 1 smaller-than-legal and 1 larger-than-legal) is fine to prove that we don't do something crazy; >1 doesn't add much value.
- x86 adds to the vector ISA with every chip, but there's little or no difference to this transform, so SSE1 and AVX1 (and 1 flavor of AVX512?) are all that's useful.
- Use multiple '--check-prefix' options per RUN to reduce the number of CHECK lines (I think '--check-prefixes' should work too). See existing vector test files for examples.
After actually implementing the vector pattern unfolding, it turned out that most of these tests weren't really interesting,
so i've cut them somewhat.
test/CodeGen/X86/machine-cp.ll | ||
---|---|---|
13 ↗ | (On Diff #145470) | Unrelated white-space diff? This patch is NFC already, so I think it's ok to keep this here or separate if I'm missing something. |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll | ||
2 ↗ | (On Diff #145470) | Why is this config interesting? IMO, it just distracts from the cases that we do care about, but I may not be seeing it. |
731–733 ↗ | (On Diff #145470) | I still doubt the value of everything above here (illegal vector types), but if you think it shows/proves something... |
test/CodeGen/X86/machine-cp.ll | ||
---|---|---|
13 ↗ | (On Diff #145470) | Yep. This function is affected by the dependent revision, so why not fix this space right away? |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll | ||
2 ↗ | (On Diff #145470) | I think we do want to check that we don't do anything stupid in andn-less case, |
731–733 ↗ | (On Diff #145470) | I do think it does, note that these illegal vectors improve in D46528. |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll | ||
---|---|---|
13 ↗ | (On Diff #145470) | Irrelevant to this patch but interesting that SSE1 can't do the xops+cmpps approach to generate all-ones (like we do on AVX1 for YMM) |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll | ||
2 ↗ | (On Diff #145470) | I agree, the non-SSE tests aren't very useful (and I'm a little dubious about SSE1 tbh). What MIGHT be useful is a XOP pass: ; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse,-sse2 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE1 As XOP is the only X86 ISA with a bsl style vector instruction (PCMOV) - maybe add this to both x86 test files? |
- Add X86 XOP test.
- Correctly use --check-prefixes=, nice.
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll | ||
---|---|---|
2 ↗ | (On Diff #145470) | BTW, is this --check-prefixes= magic documented somewhere? |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll | ||
---|---|---|
39–41 ↗ | (On Diff #145958) | What's the reason for using loads in these tests? |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll | ||
2 ↗ | (On Diff #145470) | I didn't experiment with this particular case, but I think you hit a long-standing shortcoming in update_llc_test_checks.py: # FIXME: We should use multiple check prefixes to common check lines. For # now, we just ignore all but the last. ...so it only matches common lines based on the order that they are encountered/created. |
Use load trick in test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll too, not just in test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll.
Thanks to @spatel for noticing discrepancy.
test/CodeGen/X86/unfold-masked-merge-vector-variablemask-const.ll | ||
---|---|---|
39–41 ↗ | (On Diff #145958) | They cut the count of check-lines for these <4 x i32> tests for SSE1, |
test/CodeGen/X86/unfold-masked-merge-vector-variablemask.ll | ||
2 ↗ | (On Diff #145470) | Yes, i did read that tool, and did find where the problem happens def build_function_body_dictionary(function_re, scrubber, scrubber_args, raw_tool_output, prefixes, func_dict, verbose): ..... for prefix in prefixes: if func in func_dict[prefix] and func_dict[prefix][func] != scrubbed_body: if prefix == prefixes[-1]: print('WARNING: Found conflicting asm under the ' 'same prefix: %r!' % (prefix,), file=sys.stderr) else: func_dict[prefix][func] = None <- HERE continue func_dict[prefix][func] = scrubbed_body |
For X86 I think you should remove the 8/16/32/64 vector width tests, I don't see them adding much (type legalization....) - we'd be better off adding 256-bit vector widths.
we'd be better off adding 256-bit vector widths.
I initially had 256 and 512 bit vectors here too, but they did not seem to add any additional coverage,
but added a lot of lines, so i have dropped them. But if you say they are needed, i can totally re-add them, ?
For X86 I think you should remove the 8/16/32/64 vector width tests, I don't see them adding much (type legalization....)
That is basically the same, or opposite question to adding wider vector tests.
The vector widths will be increasing, and it's unlikely we'll test every singe one of them.
But the lower vector widths will stay (8/16/32/64),
so i'm not yet seeing the logic as to why we'd want to test wider vectors, but not shallower vectors.
Add 256-bit vector tests back to X86 tests, as suggested by @RKSimon.
I have kept <128-bit vectors, for now anyway.
One observation i can make - under SSE1, the v8i32 are scalarized, not split into 2x v4i32.
It looks like a bug?
I think that's the same thing I mentioned in D46528. We're not transforming to x86-specific opcodes early, so legalization does the safe thing for the illegal types - scalarization. Nobody cares about the perf of SSE1-only at this point, so it's not worth fixing IMO. That also means it's not worth testing so thoroughly here (and I think even x86 fans' eyeballs would fall out before they get anywhere close to confirming those entire asm sequences are correct!).
So I'd second Simon's opinion that we just remove the SSE1 run, but if you insist...LGTM. :)
Thank you for the review!
So I'd second Simon's opinion that we just remove the SSE1 run, but if you insist...LGTM. :)
Yeah, i'm conflicted.
I agree that it is not of too much use, but then why do we have separate hasSSE1() and hasSSE2() knobs? :)
And if they are separate, and the tests clearly indicate that it matters here...
So i'll let this sit here for a bit longer, and wait for @RKSimon to re-confirm that they should be dropped.
The knobs are cemented with every bad architectural decision, and the compiler is required to honor that possibility...so x86 baggage keeps growing, but eventually we stop caring about the old stuff because there's no reward in that (just don't crash).