Page MenuHomePhabricator

ohsallen (Olivier Sallenave)
User

Projects

User does not belong to any projects.

User Details

User Since
Jan 22 2015, 9:14 AM (426 w, 5 d)

Recent Activity

Apr 22 2015

ohsallen committed rL235508: Fixed logic to enable complex FMA formation..
Fixed logic to enable complex FMA formation.
Apr 22 2015, 7:10 AM

Apr 20 2015

ohsallen committed rL235344: Refactoring and enhancement to FMA combine..
Refactoring and enhancement to FMA combine.
Apr 20 2015, 1:32 PM

Apr 9 2015

ohsallen committed rL234513: Refactoring and enhancement to FMA combine..
Refactoring and enhancement to FMA combine.
Apr 9 2015, 10:58 AM
ohsallen committed rL234509: Added flag to disable isel instruction on PPC target. Using regular branches….
Added flag to disable isel instruction on PPC target. Using regular branches…
Apr 9 2015, 10:41 AM

Apr 7 2015

ohsallen added a comment to D8050: Refactor and enhance FMA combine.

Ok for commit as it is?

Apr 7 2015, 8:28 AM

Apr 2 2015

ohsallen updated the diff for D8050: Refactor and enhance FMA combine.

Added comment about the canonicalization and the issue with orthogonal flags -fp-contract=fast and -enable-unsafe-fp-math. Thanks!

Apr 2 2015, 1:27 PM
ohsallen added a comment to D8050: Refactor and enhance FMA combine.

Hi Mehdi,

Apr 2 2015, 12:41 PM

Mar 26 2015

ohsallen added a comment to D8260: Do not use isel on P7 and P8.

As discussed with Hal, using the isel instruction can be beneficial with the current infrastructure because it reduces the number of basic blocks, and enables some late optimizations. Ultimately, we should keep the select instructions and expand them into branches after those optimizations happened. In the meantime, we will keep using isel by default and provide a -misel/-mno-isel option in Clang.

Mar 26 2015, 7:11 AM

Mar 24 2015

ohsallen updated the diff for D8050: Refactor and enhance FMA combine.

Added a comment (FIXME) about transforming single-precision operations into double-precision ones.

Mar 24 2015, 10:58 AM

Mar 23 2015

ohsallen added a comment to D8050: Refactor and enhance FMA combine.

I'll commit this patch this week if no one is particularly against it ;-)

Mar 23 2015, 1:38 PM

Mar 18 2015

ohsallen updated the diff for D8260: Do not use isel on P7 and P8.

Thanks for the comments, Hal. Here is a version with a change in the processor definitions of P7/P8. The logic has been moved into enableEarlyIfConversion(), and the optsize attribute is considered. I had to change the signature of that function for that purpose.

Mar 18 2015, 12:04 PM

Mar 11 2015

ohsallen retitled D8260: Do not use isel on P7 and P8 from to Do not use isel on P7 and P8.
Mar 11 2015, 10:42 AM

Mar 9 2015

ohsallen updated the diff for D8050: Refactor and enhance FMA combine.

Use lambdas when it helps to avoid code duplication and when convenient.

Mar 9 2015, 9:46 AM

Mar 6 2015

ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Committed revision 231528. Thanks for your help!

Mar 6 2015, 3:17 PM
ohsallen committed rL231528: Do not restrict interleaved unrolling to small loops, depending on the target..
Do not restrict interleaved unrolling to small loops, depending on the target.
Mar 6 2015, 3:14 PM

Mar 4 2015

ohsallen updated the diff for D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Here is the patch implementing the proposed solution. Thanks!

Mar 4 2015, 3:55 PM
ohsallen updated the diff for D8050: Refactor and enhance FMA combine.

Thanks for those useful comments! I incorporated them and also refactored the separate functions. I think the behavior regarding whether we should choose FMAD or FMA is clearer now. Tell me what you think.

Mar 4 2015, 2:19 PM

Mar 3 2015

ohsallen retitled D8050: Refactor and enhance FMA combine from to Refactor and enhance FMA combine.
Mar 3 2015, 5:56 PM

Mar 2 2015

ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Hi Hal,

Mar 2 2015, 3:42 PM

Feb 23 2015

ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

I benchmarked this patch (without the multiply-add nonsense) on POWER8 and got the following speedups :

Feb 23 2015, 9:46 AM

Feb 20 2015

ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

It seems like, in general, you want a way to measure the latency of some chain of instructions (other than just counting them). This is general problem, and I recommend going after that issue as follow-up work.

Feb 20 2015, 9:41 AM

Feb 19 2015

ohsallen updated the diff for D7514: Break dependencies in large loops containing reductions (LoopVectorize).

This patch unrolls large loops containing reductions. The cost function discussed here was implemented. Added enableAggressiveFMAFusion to TTI to fine-tune the heuristics.

Feb 19 2015, 10:11 AM

Feb 12 2015

ohsallen committed rL229027: Check interleaving without relying on debug output..
Check interleaving without relying on debug output.
Feb 12 2015, 6:16 PM
ohsallen added a comment to D7503: Tune TTI getMaxInterleaveFactor for POWER8.

I ran benchmarks on the P7 today, and I'm fine with this change.

Feb 12 2015, 3:02 PM
ohsallen committed rL228973: Change max interleave factor to 12 for POWER7 and POWER8..
Change max interleave factor to 12 for POWER7 and POWER8.
Feb 12 2015, 3:00 PM

Feb 11 2015

ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

There is a separate register-pressure heuristic, and already uses a different TTI interface to get the number of available registers. Look at the calculateRegisterUsage() function.

Feb 11 2015, 2:41 PM
ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

I think we might want to separate the current single number into two numbers: one for ILP and once for latency. But I'm not exactly sure what you're suggesting.

Feb 11 2015, 1:49 PM
ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Okay, this sounds reasonable, please provide a patch and we'll benchmark it.

Feb 11 2015, 10:55 AM

Feb 10 2015

ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Let me try to explain the rationale below the proposed cost function: UF = UF * CriticalPathLength / LoopLength

Feb 10 2015, 5:17 PM
ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Hi Michael,

Feb 10 2015, 12:30 PM
ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Maybe I'm not understanding exactly what you're proposing. Are you going to calculate the critical path length in units of instructions, or using the throughput costs, or using some latency measure?

Feb 10 2015, 11:53 AM
ohsallen updated the diff for D7514: Break dependencies in large loops containing reductions (LoopVectorize).

Full-context patch

Feb 10 2015, 10:59 AM
ohsallen added a comment to D7514: Break dependencies in large loops containing reductions (LoopVectorize).

I don't think that just ignoring SmallLoopCost for all loops with reductions will fly ;) -- but, I think that adjusting the UF threshold in a more-intelligent way certainly makes sense.

Feb 10 2015, 10:52 AM
ohsallen added a comment to D7503: Tune TTI getMaxInterleaveFactor for POWER8.

Thanks for working on this, but I don't quite understand the logic (stacking the latency of the two pipelines seems odd to me). How did you tune this?

Feb 10 2015, 7:55 AM

Feb 9 2015

ohsallen updated subscribers of D7514: Break dependencies in large loops containing reductions (LoopVectorize).
Feb 9 2015, 11:50 AM
ohsallen updated subscribers of D7503: Tune TTI getMaxInterleaveFactor for POWER8.
Feb 9 2015, 11:50 AM
ohsallen retitled D7514: Break dependencies in large loops containing reductions (LoopVectorize) from to Break dependencies in large loops containing reductions (LoopVectorize).
Feb 9 2015, 10:44 AM
ohsallen retitled D7503: Tune TTI getMaxInterleaveFactor for POWER8 from to Tune TTI getMaxInterleaveFactor for POWER8.
Feb 9 2015, 8:11 AM
ohsallen added a comment to D7128: Optimize unrolled reductions in LoopStrengthReduce.

Interesting point. I could eventually investigate that if I were to apply the current patch to LoopStrengthReduce. As Hal and I decided to use the existing functionality in the loop vectorizer, we will rely on the existing heuristics.

Feb 9 2015, 7:42 AM

Feb 6 2015

ohsallen added a comment to D7128: Optimize unrolled reductions in LoopStrengthReduce.

Here is a simpler solution: when the inner loop contains reductions and gets unrolled, the loop vectorizer should unroll the outer loop and break dependencies. For the code below, it does not happen because the loop isn't considered 'small' anymore. Attached is a patch which changes the heuristics in the vectorizer unroller, and gives a 2x speedup for this code on POWER8. If it LGTY, I will add it as a regression test.

Feb 6 2015, 11:32 AM

Feb 4 2015

ohsallen added a comment to D7128: Optimize unrolled reductions in LoopStrengthReduce.

I don't understand the problem you're trying to highlight. The loop unroller is run in two places within the standard optimization pipeline. The first place is 'early', within the inliner-driven CGSCC pass manager. When run early, it does *full* unrolling only. It is also run 'late', after the loop vectorizer, when it might also do target-directed partial unrolling. But this is after the loop vectorizer runs, so there should be no conflict.

Feb 4 2015, 9:14 AM

Feb 3 2015

ohsallen added a comment to D7128: Optimize unrolled reductions in LoopStrengthReduce.

As explained in my last email, the regular loop unroller (LoopUnroll.cpp) does not break dependencies in reduction chains. Only the loop vectorizer/unroller (LoopVectorize.cpp) does. Problem with the latter is that, the code which breaks dependencies and the one which performs unrolling is tightly coupled. So, if the loop was already unrolled by the first unrolling pass, then reductions aren't optimized by the loop vectorizer/unroller.

Feb 3 2015, 3:54 PM

Jan 26 2015

ohsallen added a comment to D7128: Optimize unrolled reductions in LoopStrengthReduce.

I agree, this needs a register-pressure threshold. Also, I thought that the loop vectorizer would also perform this transformation as part of its interleaved unrolling capability. Does it not? If not, perhaps it really belongs there (and the vectorizer already has register pressure heuristics)?

Jan 26 2015, 3:07 PM

Jan 22 2015

ohsallen added a comment to D7128: Optimize unrolled reductions in LoopStrengthReduce.

Thanks for the feedback. This makes sense and I agree. When the unrolling factor is N, N-1 additional registers are live in the loop range, so typically we could have some limit (depending on the target and/or register pressure as you suggest) and partially apply the optimization on the loop in some cases.

Jan 22 2015, 11:36 AM
ohsallen retitled D7128: Optimize unrolled reductions in LoopStrengthReduce from to Optimize unrolled reductions in LoopStrengthReduce.
Jan 22 2015, 9:34 AM