Working on ROCm at AMD
User Details
- User Since
- Oct 17 2012, 11:25 AM (430 w, 6 d)
Nov 7 2020
Nov 6 2020
Nov 3 2020
Oct 8 2020
LGTM. Thanks!
Sep 16 2020
Sep 13 2020
LGTM!
Sep 7 2020
LGTM, to the extent of my high-level understanding of the analysis. I would recommend waiting a few days and give others a chance to finish the review. I do have some nitpicks, but feel free to decide which ones warrant a change.
Aug 13 2020
Aug 12 2020
Aug 11 2020
Aug 6 2020
Jul 27 2020
Jul 12 2020
I think if the language interface insists on fixing the wave size, then I think the correct solution is to implement this in the header based on a wave size macro (which we're desperately missing). The library implementation should be responsible for inserting the extension to 64-bit for wave32
Jul 11 2020
LGTM!
Jul 10 2020
From the description: "This patch inserts the Lowerswitch pass in an appropriate place to ensure any dead blocks resulting from the transformation will be removed"
LGTM!
@cdevadas reported that other parts of the AMDGPU backend are also affected by the unreachable blocks being produced in the switch lowering. Instead of fixing each such pass separately, it seems the best way forward it to put the block elimination earlier in the flow. @cdevadas is already working on such a change.
Jul 9 2020
Jul 8 2020
The documentation for HIP __ballot seems to indicate that the user does not have to explicitly specify the warp size. How is that achieved with these new builtins? Can this be captured in a lit test here?
Jun 18 2020
Jun 17 2020
filed a bug; improved the description; added a failing testcase
Jun 16 2020
@simoll, I've been staring at the code for some time now, and I am no longer sure that I have a proper grasp on the problem. It seems my fix works, but not for the reasons that I thought it should. Running the unmodified analysis on "join-at-loop-exit.ll" shows the following debug output:
eliminate DeferredTerminators; also a small refactor for readability
Jun 15 2020
Jun 14 2020
Jun 9 2020
Jun 4 2020
Abandoned in favour of enabling the workarounds in the AMDGPU backend: https://reviews.llvm.org/D81211
Jun 3 2020
May 31 2020
This looks really good. Thanks a lot for bringing it this far! Please do check the #include for LoopInfo.h before submitting.
May 29 2020
LGTM, but please wait for approval from @arsenm
Actually, the question really is about why inc/dec are needed as separate operations either as IR intrinsics or Clang builtins. Why not just expose a __builtin_amdgcn_atomicrmw that takes a scope, and map it to the LLVM atomicrmw? That would be way cleaner. The language can provide convenience functions for inc/dec that internally call the rmw builtin.
The commit description needs fixing. These are not "llvm instructions" they are "AMDGCN intrinsics". They don't exist in the LangRef. Also, I would recommend inverting the first line of the description: "Introduce Clang builtins that are mapped to AMDGCN intrinsics".
May 27 2020
LGTM!
LGTM!
May 26 2020
May 22 2020
LGTM!
I think the patch is okay, but have a couple of doubts about the tests.
May 21 2020
I support this enhancement. Thanks for doing it!
May 20 2020
I am NOT familiar with this transformation. But I believe that the patch is well-formed, and I was able to follow the logic. The tests also demonstrate that it works as intended. LGTM.
May 18 2020
May 17 2020
TBH, I spent a small amount of time visualizing the new testcase, but couldn't figure out the problem, nor the solution. I see that the check for currentLoopSize is moved up the control flow a bit, but I didn't see why. A few LLVM_DEBUG lines would be very helpful to trace what the structurizer is doing. Also, can the new test case be reduced further? There are a couple of simple chains, and a few branches. Are all of them necessary? The main problem seems to be an inner loop with just two blocks where both blocks exit. I don't know whether both these properties matter. For example, will the problem happen with a single block that loops on itself?
Test update: Moved the convergent call to where it is actually relevant.
May 16 2020
May 15 2020
May 11 2020
May 10 2020
LGTM. It would be nice if you can move the following tests to the new file and demonstrate that that they are fixed. But that need not hold up this review any longer.
- test/Transforms/StructurizeCFG/AMDGPU/backedge-id-bug.ll
- test/Transforms/StructurizeCFG/AMDGPU/backedge-id-bug-xfail.ll
May 7 2020
May 6 2020
I have bad news. This fix does not help with the test needs-fr-ule.ll, i.e., the resulting CFG is incorret if I remove the unify-loop-exits pass. The fix-irreducible pass produces an inner loop with the header "irr.guard" and a latch "while.cond47". This is inside an outer loop with header "while.cond" and many backedges. When I run the structurizer with the current fix on this graph, the backedge of the inner loop is lost. Instead we see a spurious loop between two new blocks Flow5 and Flow4, which is governed by incorrect boolean constants when in fact they should have been %Pred variables. That information is lost along the edge Flow8 to Flow5.
May 5 2020
The implementation looks okay to me. The new comments help a lot. Waiting for tests mentioned in the comments so far ... one new test (sibling loops where the DFS jumps from one to the other before finishing the first loop) and updates to existing tests.