This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][ISEL] Directly custom lower INSERT_SUBVECTOR instead of via INSERT_VECTOR_ELT
AbandonedPublic

Authored by hsmhsm on Apr 28 2022, 10:45 PM.

Details

Summary

When the target vector size is <= 64, we can directly custom lower INSERT_SUBVECTOR
instead of taking it through INSERT_VECTOR_ELT.

Custom lowering of INSERT_SUBVECTOR for the cases where target vector size > 64 is not
really happening at the moment. We need to handle it in a generic way for all possible
allowed vec sizes. This patch is a kind of pre-checkin patch to handle it.

Diff Detail

Event Timeline

hsmhsm created this revision.Apr 28 2022, 10:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 10:45 PM
hsmhsm requested review of this revision.Apr 28 2022, 10:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 10:45 PM
hsmhsm edited the summary of this revision. (Show Details)Apr 29 2022, 1:08 AM
hsmhsm added a reviewer: tpr.
foad added inline comments.Apr 29 2022, 1:33 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801

I don't understand this. CurIdx is a ConstantSDNode here, so InsertVecElt will not do anything useful - it just returns SDValue().

hsmhsm added inline comments.Apr 29 2022, 1:41 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801

Hmm, I did not realize it.

But, I was told that returning SDValue() means treat it as legal.

And, hence, no default expansion takes place which otherwise would introduce stack access?

I myself do not understand it - I need to further explore it in detail.

hsmhsm marked an inline comment as not done.Apr 29 2022, 3:30 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801

Here is what happens - let's consider INSERT_VECTOR_ELT lowering using below example.

define amdgpu_kernel void @insertelement_v2f16_0(<2 x i16> addrspace(1)* %out, <2 x i16> %a) {
  %vecins = insertelement <2 x i16> %a, i16 100, i32 0
  store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
  ret void
}

Here, since the index is constant, we skip custom lowering (via bit manipulation), which trigger default expansion, thereby, INSERT_VECTOR_ELT will expand to VECTOR_SHUFFLE which builds new vector with the required element being properly inserted. In this case, we land-up efficiently selecting S_PACK_LL_B32_B16 instruction. So the final assembly looks like below for gfx90a.

s_load_dword s2, s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
v_mov_b32_e32 v0, 0
s_waitcnt lgkmcnt(0)
s_pack_lh_b32_b16 s2, 0x64, s2
v_mov_b32_e32 v1, s2
global_store_dword v0, v1, s[0:1]
s_endpgm

When the index is not constant, we cannot take VECTOR_SHUFFLE path, since we cannot do scalar_to_vector of dynamic index. Hence we take custom lowering via bit manipulation, and we land-up getting below assembly for gfx90a which looks bit inefficient compare to earlier one.

s_load_dword s2, s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
v_mov_b32_e32 v0, 0
s_waitcnt lgkmcnt(0)
s_and_b32 s2, s2, 0xffff0000
s_or_b32 s2, s2, 0x64
v_mov_b32_e32 v1, s2
global_store_dword v0, v1, s[0:1]
s_endpgm

Now, coming to this change w.r.t direct custom lowering of INSERT_SUBVECTOR handling, I think, I have totally missed above reasoning. I need to relook into it.

hsmhsm marked an inline comment as not done.Apr 29 2022, 3:44 AM
hsmhsm marked an inline comment as not done.Apr 29 2022, 3:49 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5801

Nevertheless, for dynamic index we need above custom lowering, otherwise, there will be stack access as below.

define amdgpu_kernel void @foo(<2 x i16> addrspace(1)* %out, <2 x i16> %a, i32 %idx) {
  %vecins = insertelement <2 x i16> %a, i16 100, i32 %idx
  store <2 x i16> %vecins, <2 x i16> addrspace(1)* %out, align 16
  ret void
}

With default expansion of INSERT_VECTOR_ELT, we get below assembly for gfx90a.

s_add_u32 s0, s0, s7
s_load_dword s6, s[4:5], 0xc
s_load_dword s7, s[4:5], 0x8
s_addc_u32 s1, s1, 0
v_mov_b32_e32 v0, 4
s_load_dwordx2 s[4:5], s[4:5], 0x0
s_waitcnt lgkmcnt(0)
s_and_b32 s6, s6, 1
v_mov_b32_e32 v1, s7
s_lshl_b32 s6, s6, 1
buffer_store_dword v1, off, s[0:3], 0 offset:4
v_or_b32_e32 v0, s6, v0
v_mov_b32_e32 v1, 0x64
buffer_store_short v1, v0, s[0:3], 0 offen
buffer_load_dword v0, off, s[0:3], 0 offset:4
v_mov_b32_e32 v1, 0
s_waitcnt vmcnt(0)
global_store_dword v1, v0, s[4:5]
s_endpgm
hsmhsm abandoned this revision.Apr 29 2022, 3:55 AM
hsmhsm marked an inline comment as not done.

This patch helped me to understand - what is going on with custom lowering of few vector operations, but, otherwise, the change within this patch itself for INSERT_SUBVECTOR does not any make sense. Hence abandoning it. We be coming up with update patch(es).