Page MenuHomePhabricator

[IR] Rename the shufflevector's undef mask to poison
Needs ReviewPublic

Authored by aqjune on Jun 8 2021, 12:36 AM.

Details

Summary

This is a patch that renames shufflevector's undef mask to poison.

By D93818, shufflevector's undef mask isn't undef anymore; it returns poison instead.

%v = shufflevector <2 x i8> %x, <2 x i8> %y, <2 x i8> <i8 0, i8 poison>
; %v[0] = %x[0]
; %v[1] = poison

Since poison is more undefined than undef, this validates many existing transformations that we wanted to support.
Also, this allows more aggressive optimizations because poison is more propagative (e.g. poison & 0 = poison whereas undef & 0 != undef).

This patch updates shufflevector mask's printed string to be poison to match its new semantics.

This has changes in clang tests as well.
They are mainly about vector intrinsics being lowered into shufflevector.
The unused elements were filled with undef previously, but with this patch they are filled with poison.
Since they are unused elements anyway, I believe this isn't a functional change in fact.
But, I'm happy with this being double-checked by someone who works on these intrinsics as well.

Diff Detail

Event Timeline

aqjune created this revision.Jun 8 2021, 12:36 AM
aqjune requested review of this revision.Jun 8 2021, 12:36 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 8 2021, 12:36 AM

I noted the cases where it looks like the undef->poison change might actually impact code using compiler intrinsic functions that have external specifications. The relevant specifications say the elements in question are "undefined", without really specifying what that means.

Currently, for the Intel intrinsics, we treat "undefined" as something more conservative than LLVM undef; see https://github.com/llvm/llvm-project/blob/d2012d965d60c3258b3a69d024491698f8aec386/clang/lib/CodeGen/CGBuiltin.cpp#L12483 . Maybe we should make the cast intrinsics more conservative to match. And maybe we should do the same for OpenCL. Would need to do some backend work to make sure we don't regress the generated code, though.

For __builtin_shufflevector, I think I'm fine with this changing the "-1" to mean poison; we don't have any external spec to conform to, and anyone explicitly passing -1 should know what they're doing. But maybe worth noting in the clang documentation.

clang/test/CodeGen/X86/avx-builtins.c
182

This change might be visible to user code.

clang/test/CodeGen/builtinshufflevector2.c
41

This might be visible to user code.

clang/test/CodeGenOpenCL/preserve_vec3.cl
27

This change might be visible to user code.

aqjune added a comment.EditedJun 8 2021, 7:50 PM

I noted the cases where it looks like the undef->poison change might actually impact code using compiler intrinsic functions that have external specifications. The relevant specifications say the elements in question are "undefined", without really specifying what that means.

Currently, for the Intel intrinsics, we treat "undefined" as something more conservative than LLVM undef; see https://github.com/llvm/llvm-project/blob/d2012d965d60c3258b3a69d024491698f8aec386/clang/lib/CodeGen/CGBuiltin.cpp#L12483 . Maybe we should make the cast intrinsics more conservative to match. And maybe we should do the same for OpenCL. Would need to do some backend work to make sure we don't regress the generated code, though.

Makes sense, the PR (https://llvm.org/PR32176) that is left at the comment says it should be something like freeze poison as well. (BTW, this means the current shufflevector lowering is already incorrect as well..)

Then, _mm256_castsi128_si256 should be lowered into something like this:

%fr = freeze <2 x i64> poison
shufflevector <2 x i64> %x, <2 x i64> %fr, <4 x i32> <i32 0, i32 1, i32 2, i32 3>

BTW, the Intel intrinsic guide for _mm256_castsi128_si256 ( https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_castsi128_si256&expand=628 ) says:

This intrinsic is only used for compilation and does not generate any instructions, thus it has zero latency.

We should teach the backend to understand the shufflevector+freeze form and lower it into an efficient assembly.

I *guess* teaching this form to the backend will be enough in terms of performance.
In practice, the frozen element won't be used in most of the cases; the middle-end's demanded elements analysis will trigger instcombine to almost always remove the freeze.

What do you think? If people agree with the shufflevector+freeze lowering, I'll create a separate patch that lowers this to the new freeze+shufflevector format (since it is already incorrect).

In practice, the frozen element won't be used in most of the cases; the middle-end's demanded elements analysis will trigger instcombine to almost always remove the freeze.

Well, in the cases it gets removed, it doesn't really matter what we use. It's likely if someone is reaching for these more obscure constructs, they're seeing that the compiler isn't doing what they want with more normal code, though.

What do you think? If people agree with the shufflevector+freeze lowering, I'll create a separate patch that lowers this to the new freeze+shufflevector format (since it is already incorrect).

Using "freeze poison" seems reasonable.

RKSimon added inline comments.Sep 20 2021, 3:36 AM
clang/test/CodeGen/X86/avx-builtins.c
182

Yes the length changing casts are worrying me as well - we could update the header to insert zero into the upper elements I suppose, in many cases these would be folded away by AVX ops implicitly zeroing the 128-bits. But we'd definitely have the potential for regressions.

1237

These look out of date - D109497 changes the loadu2 codegen to be a single 'concat' shuffle.

aqjune updated this revision to Diff 375095.Sep 26 2021, 6:36 AM

Resurrect mistakenly removed test statements

aqjune marked an inline comment as done.Sep 26 2021, 7:04 AM
aqjune added inline comments.
clang/test/CodeGen/X86/avx-builtins.c
182

I quickly skimmed through the headers in clang/lib/Headers and listed the functions calling __builtin_shufflevector with at least one -1 mask operand.
It seems there aren't very many, which is good news; I found 17 functions only (

).

But, correctly fixing these headers seems to require a lot of work.
Since using the zero vector can cause performance regressions, we need to use a frozen poison (undef) vector to encode a vector having unspecified bits.
A few months ago, I created D104790 to start using freeze(vector poison) for mm*_undefined* intrinsics. However, teaching the existing codebase to successfully deal with the frozen poison vector was a pretty tough job.
When it comes to fixing the headers, there is even no C intrinsic function that represents a frozen poison vector AFAIK.

I'll appreciate any idea or help in addressing this issue. :/

It seems llvm/lib/Target/X86/X86ISelLowering.cpp's LowerAVXCONCAT_VECTORS is relevant to efficient lowering of shufflevector %x, freeze(poison), mask.

Herald added a project: Restricted Project. · View Herald TranscriptJun 19 2022, 4:55 AM
Herald added subscribers: jsji, kosarev. · View Herald Transcript

It seems llvm/lib/Target/X86/X86ISelLowering.cpp's LowerAVXCONCAT_VECTORS is relevant to efficient lowering of shufflevector %x, freeze(poison), mask.

After patching LowerAVXCONCAT_VECTORS, lowering https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/avx-intrinsics-fast-isel.ll#L257 generates:

vblendps  $15, %ymm0, %ymm0, %ymm0

To make it fully no-op, tablegen files must be edited. I gave it a try, .td compiled successfully, but weirdly - perhaps due to either incorrect use of tablegen's pattern matcher or some hidden rule that I didn't address - vblendps is still there. The written patch is as follows:
https://github.com/aqjune/llvm-project/commit/b4393e36b33ca08ce77ae662479ceaf9a76eab8b

One of relevant, edited parts:

// llvm/lib/Target/X86/X86InstrVecCompiler.td
  def : Pat<(VT (insert_subvector undef, subRC:$src, (iPTR 0))),
            (VT (INSERT_SUBREG (IMPLICIT_DEF), subRC:$src, subIdx))>;
+
+  def : Pat<(VT (insert_subvector (freeze undef), subRC:$src, (iPTR 0))),
+            (VT (INSERT_SUBREG (IMPLICIT_DEF), subRC:$src, subIdx))>;

I spent some time but couldn't figure out why it does not work.
Can someone tell me whether the pattern matching is being correctly used? Any help is appreciated.

nikic added a comment.Jun 27 2022, 2:04 AM

Which intrinsic are you working on here? If this is about the mm_undefined intrinsics, why do we need to change those from the current status quo of using a zero value instead of undef?

Which intrinsic are you working on here? If this is about the mm_undefined intrinsics, why do we need to change those from the current status quo of using a zero value instead of undef?

It is about the mm256_castpd128_pd256 intrinsic and its friends (clang/test/CodeGen/X86/avx-builtins.c, line 146).
It was previously using shufflevector with undef masks - since the results are poison, an alternative pattern as below is necessary to represent the intrinsic:

%a1 = freeze <2 x double> poison
%res = shufflevector <2 x double> %a0, <2 x double> %a1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
nikic added a comment.Jun 27 2022, 2:36 AM

Which intrinsic are you working on here? If this is about the mm_undefined intrinsics, why do we need to change those from the current status quo of using a zero value instead of undef?

It is about the mm256_castpd128_pd256 intrinsic and its friends (clang/test/CodeGen/X86/avx-builtins.c, line 146).
It was previously using shufflevector with undef masks - since the results are poison, an alternative pattern as below is necessary to represent the intrinsic:

%a1 = freeze <2 x double> poison
%res = shufflevector <2 x double> %a0, <2 x double> %a1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>

How sure are we that we cannot simply use poison elements here? I checked what the Intel compiler guide has to say on the topic, and it uses the following wording. For "undefined" style intrinsics:

This intrinsic returns a vector of eight single precision floating point elements. The content of the vector is not specified.

For "cast" style intrinsics:

The lower 128-bits of the 256-bit resulting vector contains the source vector values; the upper 128-bits of the resulting vector are undefined. This intrinsic does not introduce extra moves to the generated code

It's not really clear what "undefined" is supposed to mean here (and how it differs from "not specified").

Unless we're aware of a specific problems in this area, I think it's okay to start out with just doing the undef -> poison replacement, and possibly backtrack if there are real-world assumptions about the specific meaning of "undefined" in this context.

PR31524 (https://github.com/llvm/llvm-project/issues/31524) discusses about the lowering of such intrinsics.
According to the PR, it seems users consider undefined intrinsics as returning unknown but consistent bits.

If it works, my updates in the draft (https://github.com/aqjune/llvm-project/commit/b4393e36b33ca08ce77ae662479ceaf9a76eab8b) will improve backends dealing with freeze(poison), which will help lowering _mm256_cast* with zero copy.

aqjune added inline comments.Wed, Aug 10, 9:48 PM
clang/test/CodeGen/X86/avx-builtins.c
182

Okay, D130339 has finally been merged.

I will make a patch that updates the mm256_castsi128_si256 and its family functions to emit shufflevector with freeze(poison) operand.