Update the reference publication for the SyncDependenceAnalysis and Divergence Analysis. Fix phrasing, formatting. Add comments on reducible loop limitation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Not related to the changes proposed in this review, but it seems the PDT is not used by the implementation. It might be good to take out that dependency.
llvm/lib/Analysis/SyncDependenceAnalysis.cpp | ||
---|---|---|
103 | Good to see both these points spelled out. We are currently working on an implementation that works with irreducible control flow. It's still a work in progress, but involves the new "CycleInfo" being introduced in D112696. I do believe that the single pass of DFA is a strength that need not be lost when handling irreducible control flow. CycleInfo provides a predictable way to work around the lack of a unique header. One just needs to take extra care about entering the same irreducible loop multiple times when constructing ModifiedPO. | |
149–151 | Is it correct to say that this is not critical for correctness? From what I understood, if other blocks (beyond the loop exit) got interleaved, the implementation will work, but unnecessarily visit and skip the interleaved blocks when walking backwards on the ModifiedPO. |
Not related to the changes proposed in this review, but it seems the PDT is not used by the implementation. It might be good to take out that dependency.
All good points. PDT and the comment on strict loop compactness are leftovers. I will remove PDT in a followup patch.
llvm/lib/Analysis/SyncDependenceAnalysis.cpp | ||
---|---|---|
103 | Regarding irreducibility, the unclear part to me always was whether the analysis result (detected joins, synchronization points) in IR still represents the synchronization we will end up getting with the final binary on actual hardware. D85603 should help here. However, if a kernel is irreducible and doesn't use the tools of the patch, the documentation says that reconvergence should be maximal, as early as possible - this may be ambiguous in irreducible control (IIUC implementation defined with cycles). This is less of an issue with reducible loops as there is a (mostly unspoken) mutual understanding where synchronization happens and so all transformations will abide to that. This is starting to fade however as more recent hardware breaks with the traditions on synchronization. | |
149–151 | Correct. Exit blocks must have a lower number than the loop header in the traversal (to account for the virtual edge from header to exit). The PO should take care of all other dependences. It may help with efficiency though. |
So will you be rephrasing the compactness in this patch? While I am convinced that it does not affect correctness, I have not thought about whether it affects accuracy, i.e., whether the lack of compactness may result in some places being marked divergent when they aren't.
llvm/lib/Analysis/SyncDependenceAnalysis.cpp | ||
---|---|---|
103 | Indeed, D85603 is a big part of the work being done for the AMDGPU backend. As part of its cleanup, I am hoping to make some conservative statements about the default behaviour when the intrinsics are not being used. The minimal goal is to gracefully handle irreducible control flow, unlike the current DA implementation where we just give up and assume everything is divergent. More as soon as I am in a confident place to make specific statements. |
I won't touch the comment on loop compactness in this patch. The change does modify how we think about this, which shouldn't happen in the same commit as an updated reference.
Giving up loop compactness (not the order constraints on the loop header/exits though!) shouldn't even change the result of the analysis tbh.
Good to see both these points spelled out. We are currently working on an implementation that works with irreducible control flow. It's still a work in progress, but involves the new "CycleInfo" being introduced in D112696. I do believe that the single pass of DFA is a strength that need not be lost when handling irreducible control flow. CycleInfo provides a predictable way to work around the lack of a unique header. One just needs to take extra care about entering the same irreducible loop multiple times when constructing ModifiedPO.