This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Order pos exports before param exports
ClosedPublic

Authored by critson on May 9 2020, 4:36 AM.

Details

Summary

Modify export clustering DAG mutation to move position exports
before other exports types.

Diff Detail

Event Timeline

critson created this revision.May 9 2020, 4:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2020, 4:36 AM
critson updated this revision to Diff 263014.May 9 2020, 4:50 AM

Fix failing test.

critson updated this revision to Diff 263100.May 10 2020, 8:39 PM

Simplify the code and improve scheduling by removing unnecessary
dependencies edges on exports.

foad added inline comments.May 11 2020, 1:15 AM
llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp
38

Can you say why it's beneficial to do position exports first?

41

It's a shame that this sorting is O(n^2) but I guess it's not a problem because the average chain length will be 2?

You could probably do this sorting in a cute way with std::partition_copy if you felt inclined: https://en.cppreference.com/w/cpp/algorithm/partition_copy

82–85

Why are the barrier edges there in the first place? Either the exports can be reordered, so the barrier edges should not be there; or they can't be, so we shouldn't ignore the barrier edges!

critson marked 2 inline comments as done.May 11 2020, 5:49 AM
critson added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp
38

I will add comment.

41

This sort is O(n), it passes through the list only once moving elements to the top as it goes.

82–85

At a high level, what is a barrier dependency on an export?
They get introduced because of intrinsics which access memory, etc.
Consider the following:

call void @llvm.amdgcn.exp.f32(i32 32, i32 15, float 1.0, float 1.0, float 1.0, float 1.0, i1 false, i1 false)
call void @llvm.amdgcn.exp.f32(i32 33, i32 15, float 1.0, float 1.0, float 1.0, float 0.5, i1 false, i1 false)
%load = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> undef, i32 %idx, i32 0, i32 0)
call void @llvm.amdgcn.exp.f32(i32 12, i32 15, float 0.0, float 0.0, float 0.0, float %load, i1 true, i1 false)

The load forces ordering between the exports either side of it. I do not think there is a hardware motivation for this?

foad added inline comments.May 11 2020, 7:07 AM
llvm/lib/Target/AMDGPU/AMDGPUExportClustering.cpp
41

Yeah but each move is done with an erase and an insert, which both have to move the all the remaining items in the vector.

82–85

EXP and EXP_DONE instructions have mayStore=1 and it seems reasonable that the scheduler doesn't have any information about what they're storing or where, so it adds barriers to stop them from being reordered. In particular they satisfy isGlobalMemoryObject in ScheduleDAGInstrs.cpp.

So I think it's fine to remove the existing barriers and add your own, if you know more about it than the scheduler :-)

critson updated this revision to Diff 263375.May 12 2020, 12:49 AM
  • Improve efficiency of sorting position exports
  • Comment on why position exports should come before others
  • Add test for reordering over loads
foad accepted this revision.May 12 2020, 5:17 AM

LGTM!

This revision is now accepted and ready to land.May 12 2020, 5:17 AM
This revision was automatically updated to reflect the committed changes.