- User Since
- May 3 2019, 8:38 AM (230 w, 3 d)
May 15 2023
Jan 10 2023
LGTM with some minor comments.
Dec 15 2022
Nov 16 2022
The two changes (Direction == '*' and checking the dep vector before doing the interchange) may add safety, but also cause tests to fail.
Think of the * being in another function that calls a function containing the < < loops.
Nov 15 2022
This is definitely an improvement and I'm glad to see this change, but please see my inline comments.
Nov 2 2022
This looks better now, but the problem of "endless interchange" is still not addressed. Per discussion in the loop opt call, we should only use CacheCost when it was able to calculate a valid cost and the loops can be sorted based on "strictly less than" ordering relationship. Only when CacheCost result is indeterminate (eg. two loops have equal cost) or when it is unable to calculate the cost (eg. due to delinearization issues), we should fall back to the isProfitableForVectorization() function.
Nov 1 2022
The current output from loop interchange is wrong, but it could be corrected if we move the vector.reduce and the store to the for.inc19.i block; ie:
middle.block: ; preds = %for.inc19.i %inc17.i = add nuw nsw i16 %j.010.i, 1 %exitcond12.not.i = icmp eq i16 %inc17.i, 4 br i1 %exitcond12.not.i, label %test.exit, label %for.j
This would collapse two dimensions of the dependency matrix into one without providing a way to recover the individual dependency vectors, and as you pointed out in the description it could make some interchangeable cases look illegal. Another example is if we have [S, >, = ] and [S, <, =], we can safely interchange the two inner loops (by factoring out the S from the beginning of both vectors), but after merging them we get [S, <>, = ] and now the inner two loops can no longer be interchanged.
Oct 18 2022
Oct 17 2022
The proper profitability analysis for loop interchange uses CacheCost, although there are cases where it may be unable to analyze certain accesses (eg due to delinearization issues). It would be good to understand why we don't go through the CacheCost path in your use case.
Oct 11 2022
Oct 7 2022
As I understand it integer divide by zero is considered undefined behaviour in C while the same may not be true in other languages. Furthermore presence of side effects may be dependent on other configurations including target hardware (eg default treatment on Power vs x86). LLVM seems to distinguish the concept of "undefined behaviour" from "undefined value", treating the former more consequential than the latter. It currently treats div-by-zero as undefined behaviour, but that may be an over pessimistic treatment as demonstrated in this review. Baking assumptions about the source language or target hardware in the LLVM IR gets us into the situation were we have to sacrifice performance for some combinations to ensure functional correctness for others. To allow more flexibility we could either leave the IR neutral and let the optimizer decide based on config info (eg. TTI) or separate the undefined behaviour/value semantics from the udiv/urem instructions (eg. using an instruction flag). I think this revision takes the first approach and what you are suggesting is the second. I agree the second approach is cleaner and might be necessary given the historic assumptions made in this regard, although it would be a larger effort.
Sep 21 2022
Aug 22 2022
Aug 3 2022
Jul 27 2022
Jul 25 2022
Jul 21 2022
Since the normalize() function is conceptually an operation that applies to the result (as opposed to the analysis itself), it makes more sense for it to be a member of the Dependence class. Client code, such as loop interchange, would then do something like this:
Jun 29 2022
Jun 28 2022
could you please add the crashing IR from https://reviews.llvm.org/D123720 as an additional test? It still seems to crash with this patch!
Jun 27 2022
The description should explain why this change is necessary.
Jun 24 2022
The cost overflow is a known issue (see https://github.com/llvm/llvm-project/issues/55233). I wonder if specifying a larger cache line size (eg. -cache-line-size=128 or 256 ) would workaround the ubsan failures for now. Otherwise we'd have to fix 55233 before recommitting this, I guess.
Jun 23 2022
The users of WIDEN-INDUCTION in the above recipe are the FIRST-ORDER-RECURRENCE-PHI ir<%prev.011> = phi ir<0>, ir<%indvars15> and EMIT vp<%6> = first-order splice ir<%prev.011> ir<%indvars15>. The first one is a VPFirstOrderRecurrencePHIRecipe and the second one is a VPInstruction, neither of which override usesScalars to return true. It doesn't seem safe to make useScalars return true for these types of recipes, so perhaps we still need to check IV->needsVectorIV()?
Jun 22 2022
Hi @fhahn , this looks like a regression from your commit below:
Jun 16 2022
Jun 10 2022
This option would also be available in llc, so the following in the description isn't quite accurate:
Jun 9 2022
Thanks for the comment, I see your concern. Would you like me to continue D127342, or would you like me to still keep the cache line size option under LoopCacheCost.cpp as we did in this patch? To avoid breaking computers with no cache, I could update CLS in LoopCacheCost.cpp to the following (I may need to change to wording a little bit) :CLS = DefaultCacheLineSize.getNumOccurrences() > 0 ? DefaultCacheLineSize : TTI.getCacheLineSize()
and pass -cache-line-size=64 to all loop interchange tests. This way the buildbot test failures can be avoided, and in the meantime for a realworld computer with no cache it won't break.
Should the setting be directly be applied in TTI::getCacheLineSize
Jun 8 2022
https://reviews.llvm.org/D110973 has landed. This should be safe to re-commit now. Thanks!
Some additional comments:
Jun 6 2022
Jun 3 2022
For the ones that had x86 previously I can reset it to x86, and for the ones that do not have a target triple I can remove it.
Having all tests depend on a single target is not ideal, because this limits the number of bots/configurations that run the tests. I think we need a way to keep most of the tests independent of the PowerPC or any other target and only have tests that explicitly test the cost-modeling depend on the target.
Jun 2 2022
Rebased and added Michael's test case.
Following the discussion in the comments, remove expandPredicationInUnfoldedLoadStore() function (to be added in a follow-up patch)
May 31 2022
...With LNICM the load of z is again hoisted into entry, and since entry is the preheader of the original outer loop, after interchange load of z is moved into for.body3.preheader, which is the new outer loop preheader....before this patch, the entry block is not considered part of the loop ...
May 30 2022
May 26 2022
Other possible fixes include modifying the gep instructions in Transforms/LICM/lnicm.ll to make mem accesses delinearizable and analyzable. I'd be fine with any possible solution. I'd appreciate it if you could let me know your thoughts? @bmahjour
May 25 2022
The following crashes even with this patch:
May 24 2022
May 19 2022
Other than the inline comments looks good to me.
Ok, LGTM. Thanks!
May 17 2022
IIRC the reason for calling formLCSSARecursively was to make sure we can handle live outs and reductions properly, which we seem to still be able to do (based on LIT test updates). The call to simplifyLoop was to ensure we don't trip any assumptions about loop simplify form when processing the epilogue loop. It may not be an issue now, but would it be a problem, say if a hypothetical utility function common to both paths wants to assert that the loop is in simplified form? Is keeping simplifyLoop() harmless to VPlan's modeling of exit values?
May 16 2022
Although the problem reported in PR55233 is worsened by D123400, the underlying issue existed even before that (as illustrated by the example in PR55233), so this patch won't completely solve the overflow problem, although it would make it a bit less likely to occur. I think a proper solution to PR55233 would need to address the overflow problem itself (eg by increasing the range of values that the cost model can represent).
May 13 2022
Fix test2 to make sure when D71539 is applied, it fails without this patch and passes with it.
May 5 2022
May 4 2022
May 3 2022
Used getLoopDisposition and improved tests.
@Meinersbur Thanks for the test case that fails without D71539. The test case I reported was bugpoint reduced and that's why the undefs are there. I'll update D110973 with a test that doesn't use undef and that passes today but would fail with D71539. I'll also add your test to it.
May 2 2022
Sorry for the delay, as I didn't get a chance to commit this before I went on vacation.
Apr 20 2022
Added a test for the multi-store case I mentioned above.
This has been superseded by https://reviews.llvm.org/D71539
Apr 19 2022
Apr 18 2022
What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.