This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add amdgcn_s_buffer_load_imm intrinsic
AbandonedPublic

Authored by piotr on Oct 31 2022, 3:02 AM.

Details

Reviewers
foad
arsenm
rampitec
kosarev
Group Reviewers
Restricted Project
Summary

Add int_amdgcn_s_buffer_load_imm instinsic for gfx9+, similar
to the existing int_amdgcn_s_buffer_load, but with immediate
instruction offset.

This exposes an immediate field of the instruction to the front-ends,
and can potentially help generate better code, especially in cases
of complex address expressions where the offset is located in
a different block than the load.

Basic handling also added in the global-isel path with the fall-back
to the old intrinsic. It is not clear at this point, whether
the new intrinsic will help global-isel.

Diff Detail

Event Timeline

piotr created this revision.Oct 31 2022, 3:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2022, 3:02 AM
piotr requested review of this revision.Oct 31 2022, 3:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2022, 3:02 AM
piotr added reviewers: foad, arsenm, rampitec, kosarev, Restricted Project.Oct 31 2022, 3:05 AM
piotr updated this revision to Diff 471956.Oct 31 2022, 4:05 AM

Updated instcombine code as well.

What's the problem with using a constant offset with the existing intrinsic?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
4915

Extra newline

piotr updated this revision to Diff 472523.Nov 2 2022, 1:09 AM

Removed extra newline.

piotr added a comment.Nov 2 2022, 1:22 AM

What's the problem with using a constant offset with the existing intrinsic?

That would work fine, but the problem is that the existing intrinsic has just one offset field and relies on the isel to extract the const part to utilize the instruction offset immediate field. The new intrinsic in addition to the existing scalar offset exposes the constant part that lets us use the immediate field of the instruction directly. So the new intrinsic would let us include "1024" as an immediate in the example below:

%load = call i32 @llvm.amdgcn.s.buffer.load.imm.i32(<4 x i32> %desc, i32 %index, i32 1024, i32 0)

(I know there are new tests with a constant for soffset and 0 for the immediate offset field, but these are to ensure the generated code is the same as with the old intrinsic. The use case with two constant offsets is really not a real-world case - the main use case is a scalar offset with separate constant offset).

I think the question is really: what IR examples are there that could use scalar loads with immediate offsets but don't because instruction selection fails to extract the constant; and why does extracting the constant fail?

piotr added a comment.EditedNov 2 2022, 6:33 AM

I think the question is really: what IR examples are there that could use scalar loads with immediate offsets but don't because instruction selection fails to extract the constant; and why does extracting the constant fail?

All cases of isel not being able to extract the constant I looked at were due to the nodes being scattered over different basic blocks.

Ah, I see. Maybe add a comment in IntrinsicsAMDGPU.td that we should aim to get rid of the intrinsic again after we've transitioned to GlobalISel?

nhaehnle added inline comments.Nov 2 2022, 7:19 AM
llvm/include/llvm/IR/IntrinsicsAMDGPU.td
1017–1019

Document whether there is an upper bound to the offset argument, or whether codegen will handle it.

piotr updated this revision to Diff 472647.Nov 2 2022, 9:23 AM

Extended comments.

piotr edited the summary of this revision. (Show Details)Nov 2 2022, 9:24 AM
arsenm added a comment.Nov 2 2022, 3:32 PM

I think the question is really: what IR examples are there that could use scalar loads with immediate offsets but don't because instruction selection fails to extract the constant; and why does extracting the constant fail?

All cases of isel not being able to extract the constant I looked at were due to the nodes being scattered over different basic blocks.

This is the kind of case that CodeGenPrepare works around for addressing mode matching. I'd rather add that sort of optimization rather than changing the IR by adding new intrinsics to workaround this

piotr added a comment.EditedNov 3 2022, 12:58 AM

I think the question is really: what IR examples are there that could use scalar loads with immediate offsets but don't because instruction selection fails to extract the constant; and why does extracting the constant fail?

All cases of isel not being able to extract the constant I looked at were due to the nodes being scattered over different basic blocks.

This is the kind of case that CodeGenPrepare works around for addressing mode matching. I'd rather add that sort of optimization rather than changing the IR by adding new intrinsics to workaround this

I did consider extending CodeGenPrepare instead. The advantage of using the intrinsic is that we can run the code through the optimizer (CodeGenPrepare is run very late), which yields even better code.

While the GlobalISel may not need the new intrinsic, one could argue that adding the intrinsic is not a workaround and it should have been implemented for gfx9 already alongside other changes. The new SMEM format features (including the separate immediate offset field) were originally omitted - see https://github.com/llvm/llvm-project/issues/38652. It was only about a couple of months ago, when @kosarev fixed this by adding support for the new format in a series of commits.

piotr added a comment.Nov 14 2022, 4:34 AM

Another problem with codegenprepare is that it does not understand our intrinsic, so it does not treat the i32 offset field as a part of address expression, or in any special way.

We do operate on loads with fat pointers in the (part of) front-end, but we convert them to intrinsics before entering codegen. The aspiration is to do the conversion later (or skip the intrinsics altogether) effectively adding support for fat pointers in the backend. But we are not there yet, and that would surely have some other issues, which we are not aware of yet.

What is the current state of this?

piotr added a comment.Mar 16 2023, 1:48 AM

What is the current state of this?

I am now convinced that this is not the right fix.

There is no easy solution available - need to experiment more. Matt suggested taking advantage of addressing mode matching in CodeGenPrepare, but this pass doesn't work for our intrinsics.

piotr abandoned this revision.Mar 16 2023, 1:48 AM