This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add amdgcn_sched_group_barrier builtin
ClosedPublic

Authored by kerbowa on Jun 20 2022, 12:13 AM.

Details

Summary

This builtin allows the creation of custom scheduling pipelines on a per-region
basis. Like the sched_barrier builtin this is intended to be used either for
testing, in situations where the default scheduler heuristics cannot be
improved, or in critical kernels where users are trying to get performance that
is close to handwritten assembly. Obviously using these builtins will require
extra work from the kernel writer to maintain the desired behavior.

The builtin can be used to create groups of instructions called "scheduling
groups" where ordering between the groups is enforced by the scheduler.
__builtin_amdgcn_sched_group_barrier takes three parameters. The first parameter
is a mask that determines the types of instructions that you would like to
synchronize around and add to a scheduling group. These instructions will be
selected from the bottom up starting from the sched_group_barrier's location
during instruction scheduling. The second parameter is the number of matching
instructions that will be associated with this sched_group_barrier. The third
parameter is an identifier which is used to describe what other
sched_group_barriers should be synchronized with. Note that multiple
sched_group_barriers must be added in order for them to be useful since they
only synchronize with other sched_group_barriers. Only "scheduling groups" with
a matching third parameter will have any enforced ordering between them.

As an example, the code below tries to create a pipeline of 1 VMEM_READ
instruction followed by 1 VALU instruction followed by 5 MFMA instructions...
1 VMEM_READ
builtin_amdgcn_sched_group_barrier(32, 1, 0)
1 VALU
builtin_amdgcn_sched_group_barrier(2, 1, 0)
5 MFMA
builtin_amdgcn_sched_group_barrier(8, 5, 0)
1 VMEM_READ
builtin_amdgcn_sched_group_barrier(32, 1, 0)
3 VALU
builtin_amdgcn_sched_group_barrier(2, 3, 0)
2 VMEM_WRITE
builtin_amdgcn_sched_group_barrier(64, 2, 0)

Diff Detail

Event Timeline

kerbowa created this revision.Jun 20 2022, 12:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 20 2022, 12:13 AM
kerbowa requested review of this revision.Jun 20 2022, 12:13 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 20 2022, 12:13 AM

Somewhat WIP needs more tests and cleanup. Posted for dependent work.

antc added a subscriber: antc.Jun 20 2022, 11:56 PM

Hey Austin -- I like the removal of canAddMIs. In the original design, I was leaving open the possibility for users to pass in canAddMIs rather than a mask / SchedGroup name, but it looks like this isn't the direction we're going, and the classification functions defined in a general canAddMI makes things easier.

I see this is a WIP, but I've added some thoughts I had from reading it over. I may have more as I use the design for my patch.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
207

I find it confusing that SchedBarrier uses inversion while SchedGroupBarrier doesn't.

314

As in the update to IGroupLP.cpp in trunk, seems like we are not supposed to use hasValue.

357

Not possible to have unsized groups?

459

If both types of barriers are present -- the SchedBarriers are handled first. However, if there is a conflict between SchedBarrier and SchedGroupBarrier, should SchedBarrier always get the priority? Maybe SchedBarrier should only handle groups not present in SchedGroupBarrier?

llvm/test/CodeGen/AMDGPU/sched-group-barrier-pre-RA.mir
105

I think you are aware of this issue. But the ability for the mutation to match the pipeline is dependent upon which instructions go into which group (when an instruction can be mapped to multiple groups).

If we had SchedGroups: 2 VMEM_READ, 1 VALU, 1 MFMA, 2 VMEM_READ

and initial schedule: VMEMR, VALU, VMEMR, MFMA, VMEMR, with a dependency between middle VMEMR->MFMA.

initSchedGroup will add the middle VMEMR to the last VMEMR group, but we could get a more accurate pipeline by adding it to the first group.

arsenm added inline comments.Jul 5 2022, 10:43 AM
clang/test/SemaOpenCL/builtins-amdgcn-error.cl
70

Test error for each argument?

kerbowa updated this revision to Diff 445965.Jul 19 2022, 3:44 PM

Fix some bugs. Add better pipeline fitting. Address comments.

jrbyrnes accepted this revision.Jul 28 2022, 9:37 AM
This revision is now accepted and ready to land.Jul 28 2022, 9:37 AM
This revision was landed with ongoing or failed builds.Jul 28 2022, 10:43 AM
This revision was automatically updated to reflect the committed changes.
uabelho added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
314

Compiling with gcc, I get a warning that this function is unused.
I'm wondering, there seems to be both a const and a non-const version of the isFull method now, but they are identical? Perhaps the non-const version could be removed?

kerbowa marked an inline comment as done.Jul 30 2022, 7:48 AM
kerbowa added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
314

Removed in 7898426a72, thanks!