This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use V_PERM to match buildvectors when inputs are not canonicalized (i.e. can't use V_PACK)
ClosedPublic

Authored by jrbyrnes on Sep 22 2022, 11:26 AM.

Details

Summary

If we can not prove that f16 operands of a buildvector are canonicalized, then we can not lower into a V_PACK. In this scenario, we would previously lower into some combination of and(sdwa), shr, or. This patch allows for matching into V_PERM instead -- which uses additional SGPR (or encodes the literal in the instruction itself), but has less VALU latency.

Change-Id: Ifa4a74fdb81ef44f22ba490c7fdf81ec8aebc945

Diff Detail

Event Timeline

jrbyrnes created this revision.Sep 22 2022, 11:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 11:26 AM
jrbyrnes requested review of this revision.Sep 22 2022, 11:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 11:26 AM
jrbyrnes edited the summary of this revision. (Show Details)Sep 22 2022, 12:18 PM
jrbyrnes edited the summary of this revision. (Show Details)
arsenm added inline comments.Sep 22 2022, 12:21 PM
llvm/lib/Target/AMDGPU/SIInstructions.td
2810–2812

Should we just replace this?

2817

Also should cover the integer cases?

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll
2–1

Switching to generated checks should be a separate pre-commit

Hey Matt, thanks for the comments. I'll address them soon, for now I'll add you as reviewer.

rampitec added inline comments.Sep 22 2022, 12:54 PM
llvm/test/CodeGen/AMDGPU/pack.v2f16.ll
2–1

GCN is misleading here. Use something like GFX7_8

jrbyrnes updated this revision to Diff 462543.Sep 23 2022, 10:26 AM
jrbyrnes marked 2 inline comments as done.

Precommit generated tests pack.v2f16.ll, rebase

jrbyrnes updated this revision to Diff 462545.Sep 23 2022, 10:32 AM

Fix attributes in test

arsenm added inline comments.Sep 23 2022, 11:50 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2810–2812

This is looking dead

llvm/test/CodeGen/AMDGPU/v_perm_non_canon.ll
3 ↗(On Diff #462545)

Probably should move this in with vector_shuffle.packed.ll

4 ↗(On Diff #462545)

Should use addrspace(1)*

36 ↗(On Diff #462545)

Separate functions for each tested shuffle.

Also needs versions using i16.

jrbyrnes updated this revision to Diff 462630.Sep 23 2022, 5:29 PM
jrbyrnes marked 2 inline comments as done.

Address review comments.

Added some patterns to account for the case when we are trying to concat:
v0[0]:v1[1]
v0[1]:v1[0]

Removed some seemingly dead patterns after introducing those.

Pushing as is for potential feedback, still a sort of WIP.

jrbyrnes marked 3 inline comments as done.Sep 23 2022, 5:32 PM
arsenm added inline comments.Sep 26 2022, 7:52 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2785–2794

Can use a class or foreach over the types to avoid repeating the same pattern twice

foad added a comment.Sep 27 2022, 2:01 AM

Can't you use v_alignbit for all the cases where you need the upper 16 bits of one register and the lower 16 bits of the other? It should be smaller than v_perm because the shift amount (16) is an inline constant.

jrbyrnes updated this revision to Diff 463311.Sep 27 2022, 12:26 PM
jrbyrnes marked an inline comment as done.

Precommit generated test + Rebase

Consolidate patterns into foreach

Lower to V_ALIGNBIT for D = V[1].low : V[0].hi

Can't you use v_alignbit for all the cases where you need the upper 16 bits of one register and the lower 16 bits of the other? It should be smaller than v_perm because the shift amount (16) is an inline constant.

Hey, thanks for the good suggestion! I think this will only work for the case where we want V[1].low : V[0].hi

In the case where we want V[1].hi : V[0].low we can't lower to V_ALIGNBIT_B32 $V0, $V1, 16 because that would incorrectly put the bits from $V0 as the MSBs in the dest. On the other hand V_ALIGNBIT_B32 $V1, $V0, 16 correctly has the bits from $V1 as the MSBs, but they are the lower 16 (and the higher 16 from $V0).

foad added a comment.Sep 28 2022, 1:26 AM

In the case where we want V[1].hi : V[0].low we can't lower to V_ALIGNBIT_B32 $V0, $V1, 16 because that would incorrectly put the bits from $V0 as the MSBs in the dest. On the other hand V_ALIGNBIT_B32 $V1, $V0, 16 correctly has the bits from $V1 as the MSBs, but they are the lower 16 (and the higher 16 from $V0).

Good point. You could use V_BFI_B32 but I guess that is no better or worse than V_PERM_B32.

In the case where we want V[1].hi : V[0].low we can't lower to V_ALIGNBIT_B32 $V0, $V1, 16 because that would incorrectly put the bits from $V0 as the MSBs in the dest. On the other hand V_ALIGNBIT_B32 $V1, $V0, 16 correctly has the bits from $V1 as the MSBs, but they are the lower 16 (and the higher 16 from $V0).

Good point. You could use V_BFI_B32 but I guess that is no better or worse than V_PERM_B32.

One small point in favor of BFI is the bitmask you need is more likely CSEable for unrelated uses

jrbyrnes updated this revision to Diff 463635.Sep 28 2022, 11:35 AM

Use V_BFI for V[1].hi : V[0].low . This allows for a bitmask which is more likely to be reused by other instructions (0xffff vs 0x7060100), potentially enabling other optimizations (e.g. CSE)

One small point in favor of BFI is the bitmask you need is more likely CSEable for unrelated uses

Thanks, good point. Changed the pattern in favor of BFI.

jrbyrnes added inline comments.Sep 28 2022, 12:03 PM
llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
241

This seems illegal to me -- using SGPR and literal as operands to VALU. Looking into it.

rampitec added inline comments.Sep 28 2022, 12:05 PM
llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
241

0 is inline literal and is free.

foad added inline comments.Sep 29 2022, 1:09 AM
llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
241

As a code quality thing, this could have been optimized to v_and_b32 v1, 0xffff0000, v0

jrbyrnes marked 2 inline comments as done.Sep 29 2022, 11:11 AM
jrbyrnes added inline comments.
llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
241

Stas -- I see, thanks!

Jay -- Interesting, I'll look into what's going on with the literal. As a side note, CodeGen is actually not good for this particular test. It seems to me the whole test can be combined into a 32 bit load. D133584 should be extended to handle this i16s, in which case this whole test will be optimized to a load.

jrbyrnes marked an inline comment as done.Sep 29 2022, 11:11 AM
jrbyrnes marked an inline comment as not done.
arsenm added inline comments.Sep 29 2022, 11:39 AM
llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
241

This could only be a 16-bit load if unaligned access is enabled (and I think we previously decided that doing unaligned 16-bit loads was probably worse than byte loads). The load question is orthogonal to how the bit masking should have been emitted

jrbyrnes updated this revision to Diff 464096.Sep 29 2022, 4:44 PM
jrbyrnes marked 2 inline comments as done.

Add pattern to select V_AND v1, 0xffff000 in the case where buildvector produces bits V1.hi : 0

Add test coverage for pattern.

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
241

Right -- good to know about decision to use byte loads. I agree it is a bit off topic for this review.

arsenm accepted this revision.Sep 30 2022, 9:02 AM

LGTM

This revision is now accepted and ready to land.Sep 30 2022, 9:02 AM
jrbyrnes updated this revision to Diff 464779.Oct 3 2022, 12:45 PM

Rebased to trunk && made necessary test modifications. NFC