User Details
- User Since
- Apr 10 2015, 3:10 PM (441 w, 2 d)
Fri, Sep 22
Looks good to me. I also tried some additional tests and they also worked as expected.
LGTM. The part I'm wondering about is in tryAllocPreloadSGPRs, the part about multiple arguments in the same SGPR. I think there is code in https://reviews.llvm.org/D158579 that handles this case as well.
Mon, Aug 28
In your test case, does the addLoopCarriedDependences function add a Barrier edge between the load and store? If so, that should prevent the case described in the comment.
Jul 12 2023
Jun 29 2023
May 15 2023
Here are some observations about the performance regression that we see once the unrolling is disabled by the vectorizer.
Apr 12 2023
Feb 14 2023
Dec 20 2022
Dec 19 2022
Nov 29 2022
Nov 16 2022
Hi @ssarda, thanks for the test case. That helps to understand the problem. Which appears to be the following MIR,
Nov 6 2022
Nov 4 2022
I changed the patch based upon @ruiling comment. He's correct that my patch doesn't catch the case when there is a uniform exit and a divergent exit. I looked at adding a HasDivergentBranch condition suggested in his comments, but that seemed to be overly conservative and created a lot of lit test changes.. Instead, the patch checks for is there is any exit block (return/unreachable) that is not uniformly reached. If that condition occurs, then all returns and unreachable blocks are unify. That fixes the test case, and only effects a couple of existing test cases. Two of those test cases, bool-legalization and skip-if-dead, are different because unreachable blocks are now added only if there is a divergent exit block. I think this conservatively will create a unified exit block correctly, but I'm interested in feedback.
Oct 31 2022
Oct 30 2022
Oct 28 2022
Based upon reviewer feedback, this update changes the name of the command-line option, and passes the command-line value as a parameter to CreateControlFlowHub
Oct 27 2022
Oct 23 2022
Oct 21 2022
Hi @ssarda, can you share a test case that is fixed by your patch?
Oct 18 2022
Hi @ssarda, I assume the case in question is for a .new use on a physical register? It sounds like the issue is that the use appears before the definition in the linear list of instructions after the pipeliner because orderDependences doesn't order them properly? If so, that's a bug. I'm surprised that doesn't cause other issues. For example, there is no guarantee that the use will appear in the same packet as the def. if they aren't ordered correctly.
Oct 11 2022
Remove \n from assert.
I think the changes in this patch, moving the handling to tied operands to CloneMachineInstr and removing the code
from ModuleSchedule.cpp, is a good change. I'm sure there is a test case for this in the Hexagon internal repo. Adding
@sgundapa in case he is able to check? I don't remember why I added this code to MachinePipeliner rather than
change CloneMachineInstr, other than I was trying to quickly fix a pipeliner bug and wasn't sure what effect changing
CloneMachineInstr would cause. But it does make more sense to put it in CloneMachineInstr.
Oct 10 2022
I'd like to try to move this patch by Ruiling forward. Using integer values instead of
boolean values for regions with many outgoing blocks is beneficial by reducing
register pressure. Using boolean vlaues requires a live value for each outgoing block,
but using an integer requires one only. In loops with lot of outgoing blocks, we see
a significant reduction in register pressure and faster compile-time.
Just rebasing. No change intended.
Oct 3 2022
Sep 26 2022
Sep 23 2022
Aug 30 2022
LGTM - thanks!
Aug 26 2022
LGTM. Thanks for adding this patch!
Aug 24 2022
This looks good to me. The patch enables the compiler to generate s_load when the user writes code that divides threadIdx.x by the wavefront size. The only suggestion I have is to add some test cases showing explicitly that the amdgpu.uniform metadata is added via the divergence analysis. (by the AnnotateUniformValues pass). The test cases provided rely upon that working correctly, though they show the end result rather than the steps needed to get the result.
Aug 16 2022
Aug 8 2022
Thanks for the fix. LGTM.
Aug 5 2022
Aug 4 2022
Is anything else needed for this patch or any additional review comments?
Aug 2 2022
This is an interesting bug. The patch does fix a flaw with the current Phi generation by the pipeliner. I think the approach here is a good one. It is a little difficult to understand the reason needed for the two different VRMaps though (much of the code in generatePhis is difficult to understand). But, the two maps do reduce some of the complexity in the code, which is good. I'll accept it once the test case comment is resolved. Thanks for fix the issue!
Aug 1 2022
Jul 29 2022
A couple more tests are needed. A test for for the image intrinsic. Also an assembler test for the v_illegal encoding.
Jul 27 2022
Jul 25 2022
Jul 14 2022
Jun 28 2022
Jun 27 2022
Jun 25 2022
Hi @critson. Thanks for reducing the test case. I don't see an error with the test case though, so I may be missing something. Or, perhaps I need to see the complete test case? Here's my understanding. With the patch, control flow still converges at TrueExit. It's just that the Exit3 and Exit4 blocks have moved in the CFG so that they appear prior to loop.exit.guard rather than after. Following the exit paths from F, G, and C. I think that should be ok?
Jun 24 2022
Hi @critson, thanks for pointing out the issue, and a potential fix. I'm looking at it now.
Jun 22 2022
Jun 19 2022
Updates to the comments in the improve-order.ll test, based up reviewer suggestions.
Jun 16 2022
Thanks for making this change. It's an interesting case. We handled a related issue, with subregisters, by adding code to preprocessPhiNodes to create copies. I don't have a problem with approach in this patch. There may be other cases in this code that need a similar change? Not sure...
Jun 14 2022
The rationale is for library code that looks like:
void example(float *p, float v) {
if (_ISA_verison == 9008 || __ISA_verison == 9010) global_atomic_fadd(p, v); else generic_atomic_fadd(p, v);
}
Jun 13 2022
No changes with this patch. It's been a while since this patch was added. I was investigating an internal test failure, which turns out to be unrelated to this patch. So, I'd like to move forward with reviewing this again.
May 24 2022
May 20 2022
This looks good. Just a couple of minor comments.
Apr 24 2022
Apr 22 2022
Hi @sgundapa, just checking if you made a similar change that wasn't upstreamed?
Apr 14 2022
Made several changes based upon suggestions and to fix a correctness issue:
- The new function, reorderNodes, must appear before collectInfos because collectInfos collects branch predicate information based upon the order of the nodes, so the wrong branch information was used after nodes were reordered.
- Changed name of the function, from improveNodeOrder to reorderNodes because changing the order is an improvement under certain conditions.
- Only reorder the nodes for large regions. The size of the region can be changed with a command-line option. The issues fixed by this patch are seen only with very large regions, so this flag limits the impact of this patch. I'm open to other ways to do this.
- Since reorderNodes appears before collectInfos, it no longer can use Visited. Instead, reorderNodes keeps track of the basic blocks in Order and uses that as part of the criteria to reorder. We only want to reorder a node that has a predecessor in Order.
Apr 11 2022
Apr 9 2022
Messed up the last revision by including another patch. This hopefully fixes it. The only new change is an additional test case, three_loops, added to test/Transforms/UnifyLoopExits/reduce_guards.ll
Added a test to show the case when we need to move the guard block predecessor from an outer loop to an inner loop. Three nested loops are needed for this. Initially, the guard block predecessor is added to the outer loop, but then need to change to the middle loop.
Apr 7 2022
Changes based upon review comments.
Apr 6 2022
It's great to see extensions to the pipeliner so that it works on more targets and loops that don't assume a hardware loop. A little while back a couple of functions, see shoulgnoreForPipelining, were added to support regular loops. However, I don't see that function used. Perhaps that effort was never completely finished? Do we need to get rid of that function, or try to combine the work you've done with that work?
Feb 22 2022
Changes based upon review. Don't just skip over insertelements with a
non-constant index. Instead, fall through to include additional cost
calculation.
Feb 20 2022
Jan 31 2022
There are some clang-format issues to be addressed before committing. Other than that, the patch LGTM. Thanks!
Jan 7 2022
Looks good to me.
Looks good to me. Thanks!
Dec 23 2021
Dec 22 2021
Changes so that the VGPR to AGPR spill slot can be removed if stack slot coloring has not allocated multiple objects to the same slot. Added a couple extra test cases to check for different cases when the slot can/can't be removed.