This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add BFE pattern matches for constrained shifts.
Needs RevisionPublic

Authored by Pierre-vh on Sep 12 2021, 9:57 PM.

Details

Summary

042e0883cbcd35641d60fd2d22105ac5c6a402f8 made clang conform to
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#operators-shift

So, if a clang -x c ... emits:

%sub = sub i32 32, %bits
%shl = shl i32 %a, %sub

The equivalent clang -x cl ... after optimization can emit:

%sub = sub i32 0, %bits
%shl.mask = and i32 %sub, 31
%shl = shl i32 %a, %shl.mask

This change adds the BFE pattern match for the -x cl output.

Diff Detail

Event Timeline

abinavpp created this revision.Sep 12 2021, 9:57 PM
abinavpp requested review of this revision.Sep 12 2021, 9:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2021, 9:57 PM
foad requested changes to this revision.Sep 13 2021, 2:00 AM

This will give the wrong result when $width is zero.

This revision now requires changes to proceed.Sep 13 2021, 2:00 AM

Can we more generally just fold out and of 31 since the shifts ignore the high bits anyway?

This will give the wrong result when $width is zero.

I'm not seeing the problem with 0?

This will give the wrong result when $width is zero.

I'm not seeing the problem with 0?

The introduced (srl (shl $src, (and (sub 0, $width), 31)), (and (sub 0, $width), 31))
on $width = 0 evaluates to (srl (shl $src, 0), 0) = $src != (v_bfe_u32 $src, 0, 0)

The original (srl (shl $src, (sub 32, $width)), (sub 32, $width)) on $width = 0
will evaluate to (srl (shl $src, 32), 32). Assuming the ISD::SHL's semantics is
the same as that of LLVM-IR's shl, (shl $src, 32) yields a poison value. If I
understand correctly, selecting v_bfe_u32 here is acceptable but not for the
pattern introduced in this revision.

I think we need the constrained shift pattern to be of the form (srl (shl $src,
(and (sub 32, $width), 31)), (and (sub 32, $width), 31)), which, on $width = 0,
will evaluate to (srl (shl $src, 32), 32) like the original pattern.

I wrote the (sub 0, $width) version since inst-combine, on visiting (and (sub
32, $width), 31), transforms the sub using SimplifyDemandedInstructionBits()
to (sub 0, $width). You can see this in the compilation of
https://github.com/RadeonOpenCompute/ROCm/issues/989 's test.

I'm not sure if there's a reasonable way for inst-combine's visitAnd() to stop
doing this so that we can match the BFE pattern correctly.

Can we more generally just fold out and of 31 since the shifts ignore the high bits anyway?

Did you mean https://alive2.llvm.org/ce/z/wE_G9a ?

foad added a comment.Sep 14 2021, 1:49 AM

This will give the wrong result when $width is zero.

I'm not seeing the problem with 0?

If $width is zero then the patterns being matched return $src unchanged, but the V_BFE instruction returns zero.

Can we more generally just fold out and of 31 since the shifts ignore the high bits anyway?

Did you mean https://alive2.llvm.org/ce/z/wE_G9a ?

I guess you meant https://reviews.llvm.org/D110231

Pierre-vh commandeered this revision.Feb 1 2023, 12:28 AM
Pierre-vh added a reviewer: abinavpp.
Pierre-vh added a subscriber: Pierre-vh.

I'm going to try and finish this one. @foad, @arsenm do you remember what needed to be done here to get this in a land-able state?

Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2023, 12:28 AM
foad added a comment.Feb 1 2023, 12:34 AM

I'm going to try and finish this one. @foad, @arsenm do you remember what needed to be done here to get this in a land-able state?

I don't remember any more than what I wrote in previous comments.