User Details
- User Since
- Feb 21 2017, 5:58 PM (344 w, 3 d)
Sun, Sep 24
Thu, Sep 21
I don't quite understand the change, now that the counters are in unknown state, shouldn't we wait for their arrival so that we can access the registers for the arguments (which may be loaded from memory in the caller)? I think we definitely need these waits at the entry of cs_chain* function.
Aug 21 2023
LGTM
Jul 19 2023
I think a more accurate description of the real problem is: sinking vgpr def which has a loop-variant sgpr source out of divergent loop is wrong for AMDGPU. For other cases, it is still legal to move out of loop. The problem is not specific to nested loops, it also applies to single loop. The test case has been over-simplified and does not show original problem (both .ll and .mir test). We can keep the code as now, but I still think it's better to update the tests to show the real problem to make it easy to revisit the issue.
Jul 18 2023
Jun 29 2023
Jun 7 2023
I think we definitely should do this! just one inline comment.
Jun 6 2023
Jun 4 2023
Thanks for the new version, looks good to me.
May 29 2023
May 18 2023
May 15 2023
May 12 2023
May 11 2023
fix the issue instead of removing the optimization.
May 10 2023
It should help because SelectionDAG::isBaseWithConstantOffset knows how to match OR (if the known bits do not overlap) as well as ADD.
Do you mean we can optimize for shl(or) with the help of knownbits? I think I agree with you. But I am not sure whether it is really helpful in practical cases. The isBaseWithConstantOffset you mentioned specifically designed to work on stack slot access. I am not sure if such patterns can also be observed more broadly. And I think such kind of optimization should be done separately, maybe it should be added in the common LLVM code. And the lit-test should also be redesigned. I would rather fix the problematic transformation first. sounds ok to you?
May 9 2023
May 5 2023
I guess you are observing code generation bug for AMDGPU? Is it replacing a phi like %1:vgpr = phi %0:sgpr with a %1:vgpr = COPY %0:sgpr? If that is the case, I think this does not sound like a root-fix.
Apr 28 2023
Apr 27 2023
Apr 26 2023
LGTM
Apr 25 2023
Why I cannot reproduce the assertion? I have tried on latest version and a version several days before.
Apr 13 2023
Apr 6 2023
Apr 5 2023
If not-taken conditional branches are cheap then we could do something like this. It only has one taken branch, when we have finished handling all the active lanes.
// Inclusive plus-scan v0 into v1. Also leaves the result of the plus-reduction in s3. s_mov s0, exec s_mov s3, 0 // accumulator // repeat this section 32 or 64 times: s_ff1 s1, s0 // find lowest remaining active lane s_cmp_eq s1, -1 s_cbranch_scc1 end s_bitset0 s0, s1 v_readlane s2, v0, s1 s_add s3, s2 v_writelane v1, s3, s1 // end of repeated section end:
Apr 3 2023
Mar 22 2023
VAL = ... // VGPR RES = ... // FINAL result of scan, active lanes will write to this VGPR sum = 0; // SGPR, holds the partial sum for (int lane = 0; lane < 64; lane++) { if(IsActive(lane)) { // check to see whether lane is active or not elementToadd = readlane(VAL, lane ); // SGPR, read value which we want to add from VAL at lane id sum = sum + elementToadd; // SGPR, update the value writelane(RES, sum, lane ) // write value sum(SGPR) to VGPR RES at lane } }
The idea here is a dangerous way to program our GPU. Please check comment below to see why we should not do this.
A possible safe way is to do something like:
// all the active threads should enter the loop. do { get_first_active_lane() bool hasUnprocessedlane = true; if (is_first_active_lane) { // only the first active lane will go here, other threads will skip to the loop end. do the work for this active lane hasUnprocessedLane = false; } } while (hasUnprocessedLane);
The hasUnprocessedLane was used to say that the first active lane being processed in this iteration should exit the loop.
Mar 9 2023
LGTM
I am not sure what your question is:
- Cycle is assumed to be divergent because it is irreducible. But operations that are always uniform need not be assumed to be divergent. That is this case.
- Cycle has divergent exit. Value that is always uniform may still be divergent at its used. That is separately handled by temporal divergence.
I am asking for the second. thanks for the explanation.
Mar 8 2023
Mar 7 2023
Feb 27 2023
Thank you for the fix! LGTM
Feb 21 2023
Feb 19 2023
Although I wanted the optimization for the undef case to be bring back in a more defined way, I think it is acceptable to match with existing divergence analysis. This would help with the transition to uniform analysis, so LGTM. Please also address Jay's suggestion. Could you also add some FIXME around the code like: optimally reporting uniform with undef input should be done in more defined way. In general CFG, it might be broken?
I think it is reasonable to match divergent analysis behavior regarding to undef. The other problem that isUniform() return true for a divergent branch instruction makes me wonder: is it the best way to replace use of divergence analysis with uniform analysis one by one? Although I am optimistic about the quality of uniform analysis, I think it may be more helpful to replace all the occurrences of divergence analysis and fix all the bugs uncovered. Ideally, we would have very little test changes. The reason is that one specific pass may not have enough test coverage. Fixing the bugs after switching all the uses of divergence analysis to uniform analysis will make us more confident that we will less likely cause regression. Any different idea?
Feb 17 2023
I think for this specific case, we should report %8 as uniform, and the branch should also be uniform. But there seems something wrong in the uniform analysis, if you try opt -passes='print<uniformity>' with the test here. We will see both the condition %8 and the conditional branch br i1 %8,... are reported as divergent. but the isUniform() query return true for the branch instruction. I think there should be something wrong in uniform analysis. If the condition is divergent, I think the branch should also be divergent. Another issue I want to point out that the uniform analysis has a subtle difference with divergence analysis. There is some comment in https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp#L13 to explain the issue. I think we need to fill the gap to switch to uniform analysis, otherwise we will regress code generation. The last time we discussed this, I think we want some target specific option in uniform analysis to match this behavior with divergence analysis.
Feb 5 2023
Feb 2 2023
Feb 1 2023
I revisit the problem again. I agree we are allowed to duplicate such intrinsic based on the fact that the intrinsic lower has already support their usage in unstructured CFG. but the case shown that duplicating such intrinsics might not be helpful for generating better code. By making them non-duplicable, we would get simplified usage of control flow intrinsics. Some testing on large set of graphics shaders, the change causes no code generation differences.
rebase with more comment
Jan 10 2023
address review comments.
Jan 9 2023
Use alignment from alloca.
use cast instead of dyn_cast.
Jan 8 2023
Thanks for the careful review @arsenm, I have fixed most of them. Please help take a second look. Thanks!
address review comments.
rebase and ping.
Update the change to use a new bit in TSFlags
Jan 3 2023
I think it is good to go. Thanks!
ping
Jan 2 2023
I don't think just checking for switch terminator would work. My suggestion is don't handle switch here. The pass has never been designed to work with switch terminator, and it needs non-trivial work to support switch terminator in this pass. As we already lower switch terminator before this pass, I think it is not important to support switch terminator in this pass now.
Dec 25 2022
Dec 22 2022
Dec 21 2022
Dec 19 2022
Dec 14 2022
Dec 12 2022
Yes.
This causes the following tests to fail due to insertion of dummy blocks:
Dec 11 2022
Thanks for working on this, but I think it is more reasonable to fix the issue early. The function is showing a problem in our cfg lowering passes for AMDGPU. The cfg lowering are being done by several passes. I know it is very tricky and fragile now, so it is not easy to determine what the right fix should be. For this specific case, the input IR does not have a return block, thus bypass the structurizeCFG pass. The AMDGPUUnifyDivergentExits happened before StructurizeCFG already has some ability to handle infinite loops, but this case was skipped because of the check if (PDT.root_size() <= 1). In this case, BB3 is a root in PDT but it is not a true exit. I would suggest we change this. If there is only one root in PostDominatorTree, please check whether it ends with return/unreachable instruction before we can skip the transformation. If the single root in PDT ends with a branch instruction, we should continue the transformation to insert a dummy return block and insert a control flow edge that is never taken with trick like br i1 true, ..., DummyReturn. By doing this, we should be able to lower the cfg correctly.
Dec 4 2022
I think the best solution for the motivating problem is in the backend, where the load instructions have reached its final form, thus could help more cases. A load in LLVM IR is still subject to either split or combine.