This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add constrained shift pattern matches.
ClosedPublic

Authored by abinavpp on Sep 22 2021, 4:27 AM.

Details

Summary

The motivation for this is due to clang's conformance to
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#operators-shift
which makes clang emit (<shift> a, (and b, <width> - 1)) for a <shift> b
in OpenCL where a is an int of bit width <width>.

Diff Detail

Event Timeline

abinavpp created this revision.Sep 22 2021, 4:27 AM
abinavpp requested review of this revision.Sep 22 2021, 4:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2021, 4:27 AM

Also should test globalisel

llvm/lib/Target/AMDGPU/SIInstructions.td
2520 ↗(On Diff #374181)

Should also handle scalar cases. Also consider handling 32-bit and 16-bit shifts

llvm/test/CodeGen/AMDGPU/shift-opts.ll
13–14 ↗(On Diff #374181)

I would prefer to split each case into a separate function. Should also test the scalar cases

foad added a comment.Sep 22 2021, 7:05 AM

Would it be better to do this as some kind of DAGCombine, so that other patterns involving shifts (like for v_lshl_add etc instructions) can take advantage of it too? I wonder how other targets do this. I see X86 has isUnneededShiftMask() which is used in PatFrags for shift instruction selection.

llvm/lib/Target/AMDGPU/SIInstructions.td
2520 ↗(On Diff #374181)

And 64-bit.

Would it be better to do this as some kind of DAGCombine, so that other patterns involving shifts (like for v_lshl_add etc instructions) can take advantage of it too? I wonder how other targets do this. I see X86 has isUnneededShiftMask() which is used in PatFrags for shift instruction selection.

A combine would be painful because then we have to have custom nodes with these semantics. IIRC X86ISelDAGToDAG does the same basic thing

foad added a comment.Sep 22 2021, 9:01 AM

Would it be better to do this as some kind of DAGCombine, so that other patterns involving shifts (like for v_lshl_add etc instructions) can take advantage of it too? I wonder how other targets do this. I see X86 has isUnneededShiftMask() which is used in PatFrags for shift instruction selection.

A combine would be painful because then we have to have custom nodes with these semantics. IIRC X86ISelDAGToDAG does the same basic thing

Fair enough. How about some handy patfrags that we can use in all isel patterns that involve shifts?

Would it be better to do this as some kind of DAGCombine, so that other patterns involving shifts (like for v_lshl_add etc instructions) can take advantage of it too? I wonder how other targets do this. I see X86 has isUnneededShiftMask() which is used in PatFrags for shift instruction selection.

A combine would be painful because then we have to have custom nodes with these semantics. IIRC X86ISelDAGToDAG does the same basic thing

Fair enough. How about some handy patfrags that we can use in all isel patterns that involve shifts?

That should work

abinavpp updated this revision to Diff 375823.Sep 29 2021, 3:17 AM

Rebased; Did not address all review comments since I'm not sure about the
best approach.

I'm still confused between the generic fold approach and the PatFrags approach.

I think the PatFrags approach makes sense, but I'm a bit worried about the large
scale substitution we have to do in all the generic shift opcode references in
all the Target/AMDGPU .td files.

As @foad mentioned, it will be neater if we do (<shift> a, (and b, <width> - 1))
-> (<shift> a, b) in AMDGPU specific compilation where the target <shift> is
generic so that the other generic shift based pattern match (like the v_lshl_add
pattern match) will work. The problem here is that we'll be violating the
generic shift's semantics and, as @arsenm mentioned, creating custom nodes will
be painful.

Is there a situation where the aforementioned folding can go wrong in the AMDGPU
compilation? Also, is there a pre-isel, global-isel and selection-dag-isel
compatible, AMDGPU specific, tablegen way of generating (<shift> a, (and b,
<width> - 1)) -> (<shift> a, b)?

abinavpp added inline comments.Sep 29 2021, 3:20 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2529 ↗(On Diff #375823)

Why can't I see the OPC_MorphNodeTo1 in the MatcherTable array of the generated
GenDAGISel.inc for the 64 bit patterns?

abinavpp planned changes to this revision.Sep 29 2021, 3:39 AM
foad added inline comments.Sep 29 2021, 7:00 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2516 ↗(On Diff #375823)

!sub(!shl(1, width), 1), surely?

foad added a comment.Sep 29 2021, 7:50 AM

I think the PatFrags approach makes sense, but I'm a bit worried about the large
scale substitution we have to do in all the generic shift opcode references in
all the Target/AMDGPU .td files.

I was hoping we could define something like this (I'm not sure of the exact syntax):

def ShiftAmount32 : PatFrags<(ops node:$src), [(i32 $src), (and (i32 $src), 31)]>;

This would match either a raw shift amount or a shift amount ANDed with 31. Then we could change every pattern that currently contains (shl $a, $b) to (shl $a, (ShiftAmount32 $b)) instead. Yes this will touch every pattern that mentions a generic shift opcode, but I think it's the right thing to do.

As @foad mentioned, it will be neater if we do (<shift> a, (and b, <width> - 1))
-> (<shift> a, b) in AMDGPU specific compilation where the target <shift> is
generic so that the other generic shift based pattern match (like the v_lshl_add
pattern match) will work. The problem here is that we'll be violating the
generic shift's semantics and, as @arsenm mentioned, creating custom nodes will
be painful.

I agree with Matt that this was not a good idea.

abinavpp added inline comments.Sep 29 2021, 6:49 PM
llvm/lib/Target/AMDGPU/SIInstructions.td
2516 ↗(On Diff #375823)

This mask is used to perform modulo width using and, i.e. 'RHS % width'. See clang's ScalarExprEmitter::ConstrainShiftValue() which is invoked by shift codegen if (CGF.getLangOpts().OpenCL).

foad added inline comments.Sep 30 2021, 1:18 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2516 ↗(On Diff #375823)

You are right of course, sorry for the noise.

abinavpp updated this revision to Diff 381841.Oct 24 2021, 10:34 PM

Changed to the PatFrags approach.

foad added a comment.Oct 25 2021, 7:35 AM

LGTM, thanks!

Possible future improvement: apply this to other instructions like s_bfm and v_alignbit, which have an operand that works like a shift amount.

llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
249

Possible future improvement: instead of ignoring AND with mask, ignore AND with any value that has all those bits set, but possibly more. For example you could ignore AND with 0xFF here. X86 isUnneededShiftMask does this.

foad accepted this revision.Oct 25 2021, 7:35 AM
This revision is now accepted and ready to land.Oct 25 2021, 7:35 AM
This revision was automatically updated to reflect the committed changes.
abinavpp added inline comments.Nov 8 2021, 6:28 PM
llvm/lib/Target/AMDGPU/AMDGPUInstructions.td
249

See D113448.