- User Since
- Jul 20 2015, 10:25 AM (426 w, 4 d)
Oct 13 2017
Dec 13 2016
Thanks for the link, Sanjay. Yes, I was just about to comment on this in the other review as i just got confirmation.. The info in that link is right. The h/w shufflers cross both domains after IVB, therefore, not suffering the bypass penalty when switching through such instructions (perm/shuf/unpack...).
Dec 12 2016
I'm working on getting some confirmation on the latest ones, but most current Core architectures suffer a 1-clk penalty switching between fp and int domains. This doesn't include the Atom line, which can do it for free.
Dec 9 2016
Dec 8 2016
That was timely and convenient, Sanjay.. Thanks :)
Nov 28 2016
lgmt.. At some point, we'll need to add support for the different "lakemont" processors so that we don't affect them all with changes like this.
Oct 17 2016
Oct 6 2016
I'll add my 2 cents here, since I did spend some time looking into the underlying x86 architectural reasons for this regression.
Apr 27 2016
In general, I'm not a big fan of blindly aligning loops (on any boundary), as this can cause random effects +ve and -ve.
It's very simple to come up with examples where aligning a loop will cause regressions on certain architectures, specifically in loops that have control flow.
Apr 22 2016
latest changes lgmt..
Apr 6 2016
I'd prefer we used a different way to control this, like perhaps another option.
Apr 4 2016
Mar 30 2016
Thanks, Sanjay.. This lgmt.
Mar 29 2016
Depending on the cost of imull, you can avoid it with:
The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.
Mar 9 2016
lgtm for being an improvement over the existing code.
Feb 29 2016
Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.
Feb 24 2016
I would like to hear from Zia and Mitch before proceeding with this patch because of #7. It would be great to have more perf data. But like I said earlier in the thread, I don't know what our policy is in a situation where a heuristic has lost value for test-suite but still has benefits for other cases.
Feb 22 2016
Feb 15 2016
Feb 14 2016
Feb 12 2016
Feb 9 2016
Thanks, David, for the review.. And thanks, Sean, for the additional data and analysis on your workloads. I'm glad it showed positive results for you.
Feb 7 2016
Feb 5 2016
I'd like to echo Sanjay's concern regarding bypassing of the CMP heuristics.
Jan 21 2016
Jan 11 2016
Right, and I believe that's all taken care of, unless I'm missing something.
Jan 4 2016
Dec 16 2015
Thanks, Hans.. I changed one of the vectors to SmallVector, and kept the other as a vector, since it's fixed and only gets malloc'ed once. I ran a few benchmarks on the first one and sampled around 3000 frames to see that size 8 captures ~90% of the cases.
Also, another quick comment regarding minimizing stack frame size before my changes:
Dec 15 2015
I'd disable the expansion under minsize, at least.. Otherwise, lgtm.
Dec 14 2015
Dec 11 2015
Thanks for all the reviews.. I think I've address most of the lower level concerns with an updated patch.
Dec 9 2015
Oct 22 2015
Oct 21 2015
Thanks, Simon.. Will do.
Oct 20 2015
I've incorporated your comments.
Oct 19 2015
New patch includes changes to use "isConstOrConstSplat" to consolidate the 2 constant checks, as suggested.
Oct 16 2015
Thanks for the follow up and explanation, Sanjay. lgtm..
Oct 15 2015
Thanks, Simon... Let me work on that, and hopefully that will also answer Sanjay's last question regarding needing a vector version in the lit test, to which I didn't comment on earlier.
Thanks Elena and Sanjay for the review and comments.
New patch with changes made based on code review comments. Factored out profitability code, and beefed up lit test.
Just another +1 on Quentin's comment regarding it being nice to see fewer extends in DAGCombine to expose more opportunities.
Oct 14 2015
Sep 17 2015
Sep 2 2015
The overlapping is interesting.. With a 0 mod 16 rsp, if we didn't overlap, we would split a cache line. At a high level, it looks like it's a reasonable strategy (without knowing all the cases in which it'll trigger). The extra immediate generation is, however, weird.
Aug 24 2015
Aug 19 2015
Aug 7 2015
Totally understand and agree with your position, and I also appreciate your patience with me.
Aug 6 2015
Aug 3 2015
OK, back from vacation.. Thanks for the additional reviews and comments. Some updates:
Jul 24 2015
I've addressed all of your comment (I hope).
Jul 22 2015
Removed a tab that snuck into last update.
Applied Michael's suggestions.