- User Since
- Feb 10 2016, 2:11 PM (371 w, 4 d)
Mar 5 2020
Since you discovered this problem with non-default pass pipeline, then I do not need to see your test. With the default llvm passes, it was easy for me to find tests with load where the patch can be beneficial because of limitation in alias analysis preventing some optimizations and this patch can clean up things. But, I was not able back then to find meaningful test with non-load where this patch can help. It is great that you have a test (with your non-default passes) where this patch can be useful. It seems that both of us tested this code very well on ARM. Also, theoretically removing an extra instruction should be beneficial for other architectures too.
The patch looks good to me. It is a generalization of the original patch. The only thing missing here is to describe the testing results or post them here (even though on paper, getting rid of a truncate instruction should be beneficial).
Sep 19 2019
What if the code is written with intrinsic but using mul, reduce (say similar to last test in this patch), then this patch will optimize that into dot product instruction. So, for legacy code that was written with old intrinsic, then this patch will remain useful even after dot product is implemented in the vectorizer.
Sep 18 2019
There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help). There is also sve dot product. We need to port this code to SVE.
Mar 6 2019
Mar 1 2019
Feb 28 2019
Feb 27 2019
Feb 22 2019
Sep 7 2018
Sep 6 2018
Aug 23 2018
Jul 27 2018
Some cleaning and update to comments.
Jul 25 2018
Jul 24 2018
Jul 13 2018
Jul 10 2018
Mar 22 2018
Mar 20 2018
Mar 19 2018
add LLVM codegen tests as suggested in the reviews.
Mar 16 2018
Was not able to update this particular review with the new code, So I created a new one in https://reviews.llvm.org/D44591
Mar 9 2018
Mar 8 2018
Feb 22 2018
Feb 12 2018
Committed as r324940 and r324912
Feb 8 2018
Question about the failures: I am now wondering if this means we were and still are missing tests?
Feb 6 2018
Jan 19 2018
Jan 11 2018
Jan 10 2018
Jan 5 2018
Dec 21 2017
Dec 18 2017
Dec 11 2017
Code has been up-streamed in https://reviews.llvm.org/rL320123 and https://reviews.llvm.org/rL320204.
Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.
Dec 8 2017
Dec 7 2017
Nov 29 2017
Added caching mechanism to save decisions per target of whether to go into the interleaved store replacement pass or not.
Nov 21 2017
Nov 16 2017
Nov 10 2017
Code reuse and simplification due to the use of a table that has instruction replacement info.
Nov 7 2017
Nov 6 2017
Adding subtarget as part of key and simplifying some code.
Nov 2 2017
Oct 31 2017
Made caching mechanism for instruction replacement decisions to work across functions (instead of within function only) by declaring SIMDInstrTable as a global variable.
Oct 30 2017
Hmmmm.... this seems to be a tough one to find the right tradeoffs....
My personal opinions/observations:
- Making ShouldExitEarly stop the whole pass even if some of the optimizations would trigger is a hack that makes this pass operate in a non-obvious way and that will bite us in the future.
- I guess there's a chance even more peephole-y patterns will be added to this pass in the future, which would make ShouldExitEarly even more expensive to check for all rewriting rules implemented. But also, if that function would take short cuts and heuristics, the pass would operate in more surprising and non-obvious ways.
The above made me wonder if to keep compile-time under control, it wouldn't be better to integrate this in one of the existing frameworks for peepholes rather than in a pass of its own. And then I looked at D21571 where this pass got introduced and saw such approaches were probably tried and rejected in early version of that review.
Right now, my best guess of the easiest way forward is to make ShouldExitEarly compute the profitability of all rewrite rules implemented, and let it cache the results of that analysis for each subtarget it encounters. That way, the cost of this computation should be amortized out over all functions the pass will handle, resulting in not having too much impact on compile time?
Having said that, I don't know of another pass that has a caching mechanism like this. But I also haven't tried looking for such an example. I'm not sure if there's going to be a gotcha in trying to cache those results somehow.
What do you think?
A version that addresses the latest comments and in particular a more complete version of shouldExitEarly().
Oct 25 2017
I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.
Does that make sense, or am I misunderstanding something here?
Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?
Oct 13 2017
But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?
Oct 11 2017
Kristof, thanks again for the new feedback. Uploaded a new version.
Oct 9 2017
Kristof, Thank you for the feedback. Very useful comments to make the patch more readable and more maintainable.
I addressed your comments with the new version.
Sep 29 2017
Sep 25 2017
Jun 20 2017
Jun 19 2017
Jun 16 2017
Ok, I will upload the combination of the two patches here.
Jun 15 2017
Jun 13 2017
Jun 1 2017
May 30 2017
May 15 2017
Removed unnecessary f32 intrinsic tests; removed tabs