This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][IGLP] Add iglp_opt(1) strategy for single wave gemms
ClosedPublic

Authored by jrbyrnes on May 3 2023, 10:19 AM.

Details

Summary

This adds the IGLP strategy for single-wave gemms. The SchedGroup pipeline is laid out in multiple phases, with each phase corresponding to a distinct pattern present in gemm kernels. The resilience of the optimization is dependent upon IR (as seen by pre-RA scheduling) continuing to have these patterns (as defined by instruction class and dependencies) in their current relative ordering.

The kernels of interest have these specific phases:
NT: 1, 2a, 2c
NN: 1, 2a, 2b
TT: 1, 2b, 2c
TN: 1, 2b

The general approach taken was to have a long SchedGroup pipeline. In this way the scheduler will have less capability of doing the wrong thing. In order to resolve the challenge of correctly fitting these long pipelines, we leverage the rules infrastructure to help the solver.

Diff Detail

Event Timeline

jrbyrnes created this revision.May 3 2023, 10:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 10:19 AM
jrbyrnes requested review of this revision.May 3 2023, 10:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2023, 10:19 AM
jrbyrnes updated this revision to Diff 520742.May 9 2023, 10:25 AM

Rebase + Supersede DemoOpt

kerbowa added inline comments.Jun 4 2023, 11:04 PM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
906–1096

Extra private.

1147

For this type of idiom wouldn't you normally use a visited set? Could also be improved by tracking all the VEMM_R and their associated DS_WRITE with perms via some map?

1154

Could both of these find_if before the if statement be hoisted to the outer loop?

1160

It looks like the assumption is that each V_PERM DS_WRITE pair will have one VMEM load associated with it, since the loop stops at the first VMEM found? Is this correct?

1166

If there is no vmem successors can you mark the instruction as visited, or else just move this before the inner loop like I recommend above?

1176

I think the calculation for DSWWithSharedVMEMCount could be simplified by iterating over DSWithPerms once, while adding each found VMEM to a list or map with a count that increases by 1 for each found DSWithPerm paired with that VMEM. To avoid some of the repeated traversals of preds/succs of the same instructions?

jrbyrnes updated this revision to Diff 528645.Jun 5 2023, 5:31 PM
jrbyrnes marked 6 inline comments as done.

Thanks @kerbowa for pointing out this loop.

Rework loop to optimize + explicitly check for complete match & encode assumptions.

+Rebase

Looks correct to me, my only concerns are about the performance of the rule implementations but, I'm not sure how critical that is if this is only targeting a few small kernels.

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
1177

Again, would caching these results help, or is it unlikely to be a bottleneck?

1224

Will these rules be called relatively often, would it help to cache the first 4 mfma in cases like this?

1225

Does it need to be a direct pred of the MFMA or does it only need to be able to be scheduled before it, could you use reachability in the DAG here instead of searching the preds like this?

jrbyrnes updated this revision to Diff 530954.Jun 13 2023, 9:56 AM
jrbyrnes marked 2 inline comments as done.

Like you, I am mindful of the compile time costs, but not overly concerned since it is only relevant to a few blocks.

That said, the cache idea is a cheap way to get some compile time improvements, so I've implemented that feature.

kerbowa added inline comments.Jun 25 2023, 11:23 PM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
139

Looks good, but this seems a bit vague as a field in SchedGroup. Can these just be captured in the lamdas since the way each rule would use a cache may be different?

jrbyrnes updated this revision to Diff 534669.Jun 26 2023, 11:30 AM

Move caches into rule scope.

kerbowa added inline comments.Jul 5 2023, 8:22 PM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
116

Can you make this into a class or struct? We may want to add more to the Rule class than just the classifier and a cache.

140

Unused.

jrbyrnes updated this revision to Diff 537758.Jul 6 2023, 9:00 AM

InstructionRule class

kerbowa added inline comments.Jul 6 2023, 11:38 AM
llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
113–114

Put this in the class so that it can access the cache?

115

What if this was a member function in InstructionRule, that way rules could be reused. Or do you think that the rules will always be so specific that it is not needed and that common functionality should just be in some functions somewhere?

jrbyrnes updated this revision to Diff 537874.Jul 6 2023, 2:07 PM

Convert function_ref to functor-like classes

jrbyrnes updated this revision to Diff 537877.Jul 6 2023, 2:09 PM

Remove unintended errs()

kerbowa accepted this revision.Jul 9 2023, 1:23 PM

LGTM

This revision is now accepted and ready to land.Jul 9 2023, 1:23 PM
This revision was landed with ongoing or failed builds.Jul 13 2023, 12:04 PM
This revision was automatically updated to reflect the committed changes.