- User Since
- Oct 10 2019, 12:42 PM (142 w, 5 d)
Thu, Jun 23
Wed, Jun 22
Address review comments
Tue, Jun 21
Addressed review comments: Use ranges in for loop and split generateWaitcntInstBefore to reuse its second half for generateWaitcntBlockEnd.
Also changed applyPreexistingWaitcnt to take a MachineBasicBlock::instr_iterator instead of a MachineBasicBlock::iterator to match the caller.
Jun 1 2022
May 30 2022
Addressed review comments and fixed debug build.
May 26 2022
Add support for nested loops, and fixes the case where we have no terminator but preexisting s_waitcnt instructions. I still need to address @foad comments.
Mar 22 2022
@piotr I ran compile time testing and the patch has no significant impact. Worst case is 1.1% and is the only one above 1%. Average is below 0.1%.
Mar 21 2022
Mar 10 2022
Rebased and addressed @foad comment.
Mar 3 2022
@foad I updated the patch. It is a lot simpler than the previous, and it fixes both the GFX9 and GFX10 cases. But it may still have a significant impact on compile time, I cannot think of another way to do that without visiting all the instructions from the loops :(
Feb 17 2022
I agree that the code is a lot more complicated with this patch than it was with the previous patch. I think this improvement cannot be implemented simply by looking at the WaitcntBrackets without requiring all this refactoring.
So this means we have two choices:
- Original patch: Before inserting the waitcnts, visit all the loop instructions a single time (no fixed-point) until we visit all the instructions, or until we find an instruction that invalidates the optimization. Depending on what we found, flush in preheaders or not before generating the waitcnts. This is a lot simpler.
- New patch with refactoring: No need to visit the instructions before inserting the waitcnt, but we need to compute two brackets for each block, and keep two waitcnt lists until we decide which one we want to generate. This is a lot more complicated.
The original motivation to work on 2. was concerns about compile-time impact of 1., but because we need to compute two brackets for each block, I actually don't think that 2. is more efficient. Both the GFX9 and GFX10 improvements can be implemented with either 1. or 2. So for the sake of simplicity, I think I should revert back to the original patch.
Feb 16 2022
Feb 15 2022
Refactoring of the pass. We compute the brackets for both the flushed and non-flushed versions of each outer loop until we finish visiting the loop or until we decide it is not worth to flush in the preheader. Instead of generating the waitcnts, we add them to two separate lists (for the flushed and non-flushed versions). After all the blocks are visited, we generate one of the two lists depending on the decision we made.
Dec 20 2021
@arsenm the idea is to rotate the processing loop (runOnMachineFunction) of the SIInsertWaitcnt pass, so we can make decisions from predecessors, not the actual machine IR loop containing the waitcnt:
That makes sense, thanks for the comments. So maybe I can start by trying to implement the idea of rotating the processing loop and making a decision from the predecessors and see if that solution would be enough to fix the most common cases we observed. Then, if control flow inside loops is really an issue with that solution, I can try the other solution based on GPR liveness, but I also don't like the idea of having another fixed-point loop that could significantly degrade compilation time.
Dec 14 2021
May 12 2021
May 11 2021
Revert changes improving the pass implementation, these changes will be addressed in another patch.
May 7 2021
Add the TargetRegisterInfo::getAllocatableSets method to avoid calling getReservedRegs multiple times when we want allocatable sets for multiple register classes.
May 5 2021
May 4 2021
Apr 30 2021
Enable the passes only from -O2 or when explicitly enabled with flags. Also update the test case to ensure the passes are run for -O1 when explicitly enabled.
Apr 29 2021
Remove checks for the "Pass Arguments:" lines in the test case.
Apr 28 2021
Add test case
Apr 27 2021
Apr 26 2021
Update patch so it applies upstream.
Feb 11 2021
Specialize DenseMapInfo directly with the PPC TOC key type instead of VariantKind.
Feb 10 2021
Generate uninitialized external TLS data with the .csect directive instead of .comm
Feb 9 2021
I forgot to link the commit with this differential, but the patch has been committed: https://github.com/llvm/llvm-zorg/commit/45c4f238dc6fd9855b8578aa3ca3b8db336efb7e
Feb 8 2021
Add support for variables with internal linkage.
Feb 5 2021
Feb 4 2021
Jan 25 2021
Rebase and reduce PPC test case.
Jan 22 2021
LGTM, I only have a minor comment.
Thanks for addressing the comments, LGTM now.
Jan 20 2021
Gentle ping :)
Jan 15 2021
Use isISA3_1 instead of hasP10Vector and add P9 run line in test case.
Jan 12 2021
Thanks a lot for working on that Amy ! I have some comments on the patch.
Jan 11 2021
Jan 5 2021
Remove unrelated NFC change, simplify lane mask computation and MIR code
Dec 16 2020
Rebase and fix comment
Dec 15 2020
Dec 8 2020
Good catch, thanks for fixing that.
Dec 4 2020
LGTM, just a minor comment regarding the multiclass.
Dec 3 2020
Dec 2 2020
Update patch so it applies to ToT
Dec 1 2020
Nov 26 2020
Why do we need these flags for PLXVP and PSTXVP ? Please, add a test case showing why this is required.
Nov 24 2020
Nov 23 2020
Nov 19 2020
Nov 18 2020
Nov 16 2020