Page MenuHomePhabricator

[libomptarget][devicertl] Remove branches around setting parallelLevel
ClosedPublic

Authored by JonChesterfield on Jul 9 2021, 7:25 AM.

Details

Summary

Simplifies control flow to allow store/load forwarding

This change folds two basic blocks into one, leaving a single store to parallelLevel.
This is a step towards spmd kernels with sufficiently aggressive inlining folding
the loads from parallelLevel and thus discarding the nested parallel handling
when it is unused.

Transform:

int threadId = GetThreadIdInBlock();
if (threadId == 0) {
  parallelLevel[0] = expr;
} else if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// because
unsigned GetLaneId() { return GetThreadIdInBlock() & (WARPSIZE - 1);}
// so whenever threadId == 0, GetLaneId() is also 0.

That replaces a store in two distinct basic blocks with as single store.

A more aggressive follow up is possible if the threads in the warp/wave
race to write the same value to the same address. This is not done as
part of this change.

if (GetLaneId() == 0) {
  parallelLevel[GetWarpId()] = expr;
}
// =>
parallelLevel[GetWarpId()] = expr;
// because
unsigned GetWarpId() { return GetThreadIdInBlock() / WARPSIZE; }
// so GetWarpId will index the same element for every thread in the warp
// and, because expr is lane-invariant in this case, every lane stores the
// same value to this unique address

Diff Detail

Event Timeline

JonChesterfield requested review of this revision.Jul 9 2021, 7:25 AM
JonChesterfield created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 7:25 AM

Introducing write races is something I'd prefer to avoid. Is there any measurable improvement through this change?

JonChesterfield added a comment.EditedJul 9 2021, 9:40 AM

It cleans up the IR a lot but performance change is in the noise on amdgpu. I suspect our main bottlenecks are elsewhere. I'd be interested to hear how it changes a recent nvptx card.

My working theory is that once all the branches that are introduced by the openmp runtime have been optimised out, the IR will look much like it does under cuda and perform much the same, except for overheads in the host runtime.

Two options to avoid the race are to take only the first transform, which doesn't currently fold the loads but might do after some other optimisations improve, or to use a relaxed atomic store to make the race well defined (and probably emit exactly the same ISA).

I'll split this into the definitely good and racy subsections, see if I can get opt to drop parallelLevel for spmd without the latter.

  • drop second transform
JonChesterfield edited the summary of this revision. (Show Details)Jul 9 2021, 10:14 AM
JonChesterfield edited the summary of this revision. (Show Details)

Ping. Trivial change, makes codegen better on both targets, step on the path to eliminating spmd overhead. Can we have this?

This revision is now accepted and ready to land.Jul 12 2021, 5:08 PM