Page MenuHomePhabricator

[AMDGPU][GlobalISel] Add Shift/Shufflevector Combine
AbandonedPublic

Authored by Pierre-vh on Sep 26 2022, 3:55 AM.

Details

Reviewers
arsenm
Summary

Folds G_LSHR + G_SHUFFLEVECTOR away when the shift amount
is 1/2 of the destination scalar size, and the mask starts with 1.

Depends on D134870.

Diff Detail

Event Timeline

Pierre-vh created this revision.Sep 26 2022, 3:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2022, 3:55 AM
Pierre-vh requested review of this revision.Sep 26 2022, 3:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2022, 3:55 AM

Another thing worth thinking about is whether we should really be using legal vector_shuffle at all. The DAG path doesn't use it. Given that we have such a restricted set of shuffles (only 0, 1), it may still make sense to keep lowering to simple BUILD_VECTORs with 2 elements

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
357–364

Should use mi_match

384

eraseFromParent

Pierre-vh updated this revision to Diff 462900.Sep 26 2022, 7:23 AM
Pierre-vh marked 2 inline comments as done.

Comments

I haven't touched the legalization of G_SHUFFLE_VECTOR I think

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
357–364

It doesn't look like there is a m_GShuffleVector, do I need to add it myself? I'm not too familiar with MIPatternMatch

arsenm added inline comments.Sep 26 2022, 7:26 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
357–364

Yes. This one is also a bit strange with the shuffle mask operand

Use mi_match

foad added inline comments.Sep 27 2022, 1:54 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
338–340

This doesn't make sense to me:

  • The original has 0s in its high bits but the replacement does not.
  • AMDGPU is little-endian so (G_BITCAST (G_SHUFFLE_VECTOR %a, %b, shufflemask(1, ?))) puts the high bits of %a in the low bits of the result - and then the G_LSHR shifts them out completely.
369

"... of the SHUFFLE op"?

377

LHS and RHS could have more than two elements, in which case I would expect this bitcast to fail machine verification. Can you add a test for that case?

Pierre-vh updated this revision to Diff 463161.Sep 27 2022, 2:54 AM
Pierre-vh marked 3 inline comments as done.

Comments

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
338–340

This is really a headache to understand for me, but IIRC endianness doesn't affect values themselves, but how they're represented in memory? (https://stackoverflow.com/questions/7184789/does-bit-shift-depend-on-endianness)

So I think that G_SHUFFLE_VECTOR (1,?) puts the low bits in the high bits of the results, and the LSHR puts them back in place.

%a:(<2 x s16>) = [0xDEAD, 0xBEEF]
%b:(<2 x s16>) = undef
%c = G_SHUFFLE_VECTOR %a, %b, shufflemask(1,0)
; %c = [0xBEEF, 0xDEAD] 
%d = G_LSHR %c, 16
; %d = [0x0000, 0xBEEF]

So the lower 16 words stay the same, but indeed a difference before/after the combine is that the high words have 0000 if we leave the combine, but now they have "junk".

Maybe an additional trunc is needed in between or something?

What do you think @arsenm ?

377

Good catch, I'll enforce that the operands must be the same size as the dest.

Pierre-vh updated this revision to Diff 463162.Sep 27 2022, 3:05 AM

Fix type check

foad added inline comments.Sep 27 2022, 3:14 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
338–340

I think you need to check for a mask of (?,0) instead of (1,?).

Pierre-vh added inline comments.Sep 27 2022, 3:28 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
338–340

I think the mask is fine, for instance, in the test this is trying to fix, there's those two shufflevectors that need to be folded away to match the DAG codegen

%3:_(<2 x s16>) = G_SHUFFLE_VECTOR %0:_(<2 x s16>), %4:_, shufflemask(1, 0)
%37:_(s32) = G_BITCAST %3:_(<2 x s16>)
%33:_(s32) = G_CONSTANT i32 16
%38:_(s32) = G_LSHR %37:_, %33:_(s32)
%29:_(s16) = G_TRUNC %38:_(s32)

%5:_(<2 x s16>) = G_SHUFFLE_VECTOR %2:_(<2 x s16>), %4:_, shufflemask(1, 1)
%32:_(s32) = G_BITCAST %5:_(<2 x s16>)
%34:_(s32) = G_LSHR %32:_, %33:_(s32)
%21:_(s16) = G_TRUNC %34:_(s32)

Though to make this fully correct I think I need to take the trunc into account, so I will do that.

Pierre-vh updated this revision to Diff 463180.Sep 27 2022, 3:47 AM

Take trunc into account

Also note using legal shuffle_vector is a difference from the DAG we could go back on. Ultimately it's probably fewer total combines to only have to think about 16-bit shifts

arsenm added inline comments.Sep 28 2022, 9:54 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
338–339

Should break up this comment to avoid wrapping the pattern bit

368

LHS and RHS have to have the same type

394

Need to call the observer if modifying the argument list

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-trunc-shift-shufflevector.mir
273

Could use a test where the shuffle has multiple users

Pierre-vh marked 4 inline comments as done.

Comments

foad added inline comments.Sep 29 2022, 1:37 AM
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
338–340

This is really a headache to understand for me, but IIRC endianness doesn't affect values themselves, but how they're represented in memory?

It affects bitcasts. See https://llvm.org/docs/LangRef.html#vector-type:

A bitcast from a vector type to a scalar integer type will see the elements being packed together (without padding). The order in which elements are inserted in the integer depends on endianess. For little endian element zero is put in the least significant bits of the integer, and for big endian element zero is put in the most significant bits.

Pierre-vh abandoned this revision.Sep 30 2022, 4:57 AM

Not the right kind of fix + too controversial of a change for a relatively small regression in mad_mix GISel.