User Details
- User Since
- Jan 22 2015, 9:14 AM (426 w, 5 d)
Apr 22 2015
Apr 20 2015
Apr 9 2015
Apr 7 2015
Ok for commit as it is?
Apr 2 2015
Added comment about the canonicalization and the issue with orthogonal flags -fp-contract=fast and -enable-unsafe-fp-math. Thanks!
Mar 26 2015
As discussed with Hal, using the isel instruction can be beneficial with the current infrastructure because it reduces the number of basic blocks, and enables some late optimizations. Ultimately, we should keep the select instructions and expand them into branches after those optimizations happened. In the meantime, we will keep using isel by default and provide a -misel/-mno-isel option in Clang.
Mar 24 2015
Added a comment (FIXME) about transforming single-precision operations into double-precision ones.
Mar 23 2015
I'll commit this patch this week if no one is particularly against it ;-)
Mar 18 2015
Mar 11 2015
Mar 9 2015
Use lambdas when it helps to avoid code duplication and when convenient.
Mar 6 2015
Committed revision 231528. Thanks for your help!
Mar 4 2015
Here is the patch implementing the proposed solution. Thanks!
Thanks for those useful comments! I incorporated them and also refactored the separate functions. I think the behavior regarding whether we should choose FMAD or FMA is clearer now. Tell me what you think.
Mar 3 2015
Mar 2 2015
Hi Hal,
Feb 23 2015
I benchmarked this patch (without the multiply-add nonsense) on POWER8 and got the following speedups :
Feb 20 2015
Feb 19 2015
This patch unrolls large loops containing reductions. The cost function discussed here was implemented. Added enableAggressiveFMAFusion to TTI to fine-tune the heuristics.
Feb 12 2015
Feb 11 2015
There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.
I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.
Okay, this sounds reasonable, please provide a patch and we'll benchmark it.
Feb 10 2015
Let me try to explain the rationale below the proposed cost function: UF = UF * CriticalPathLength / LoopLength
Hi Michael,
Maybe I'm not understanding exactly what you're proposing. Are you going to calculate the critical path length in units of instructions, or using the throughput costs, or using some latency measure?
Full-context patch
I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense.
Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?
Feb 9 2015
Interesting point. I could eventually investigate that if I were to apply the current patch to LoopStrengthReduce. As Hal and I decided to use the existing functionality in the loop vectorizer, we will rely on the existing heuristics.
Feb 6 2015
Here is a simpler solution: when the inner loop contains reductions and gets unrolled, the loop vectorizer should unroll the outer loop and break dependencies. For the code below, it does not happen because the loop isn't considered 'small' anymore. Attached is a patch which changes the heuristics in the vectorizer unroller, and gives a 2x speedup for this code on POWER8. If it LGTY, I will add it as a regression test.
Feb 4 2015
Feb 3 2015
As explained in my last email, the regular loop unroller (LoopUnroll.cpp) does not break dependencies in reduction chains. Only the loop vectorizer/unroller (LoopVectorize.cpp) does. Problem with the latter is that, the code which breaks dependencies and the one which performs unrolling is tightly coupled. So, if the loop was already unrolled by the first unrolling pass, then reductions aren't optimized by the loop vectorizer/unroller.
Jan 26 2015
Jan 22 2015
Thanks for the feedback. This makes sense and I agree. When the unrolling factor is N, N-1 additional registers are live in the loop range, so typically we could have some limit (depending on the target and/or register pressure as you suggest) and partially apply the optimization on the loop in some cases.