This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add pass to optimize reqd_work_group_size
ClosedPublic

Authored by arsenm on May 17 2018, 3:45 AM.

Details

Summary

Eliminate loads from the dispatch packet when they will have
a known value.

Also pattern match the code used by the library to handle partial
workgroup dispatches, which isn't necessary if reqd_work_group_size
is used.

Diff Detail

Event Timeline

arsenm created this revision.May 17 2018, 3:45 AM
arsenm updated this revision to Diff 147285.May 17 2018, 3:59 AM

Also handle -cl-uniform-work-group-size attribute

As far as I understand it is only applicable if:

  • both reqd_work_group_size is used and the program is compiled with -cl-uniform-work-group-size
  • reqd_work_group_size is used and the program is compiled with -cl-std less than 2.0.

Potentially other languages can benefit it as well per language standard.

This may be an easier work for an FE to call simplified function, but an FE will not solve the issue with call from a non-kernel function. Since you are writing the whole pass for it makes sense to address this as well.

arsenm updated this revision to Diff 147447.May 18 2018, 1:52 AM

Account for difference between 1.2 and 2.0 wrt uniform-work-group-size

This revision is now accepted and ready to land.May 18 2018, 1:20 PM
arsenm closed this revision.May 18 2018, 2:49 PM

r332771