Hi Amjad,
Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Oct 13 2017
Dec 13 2016
In D27692#621297, @RKSimon wrote:So does that mean both this and D27684 can be committed safely? For recent hardware it makes no difference (and D27684 possibly saves a few instruction bytes). For older hardware we still save cycles compared to performing the extra shuffles and we should fix up the domain switches where possible to help a little more.
Thanks for the link, Sanjay. Yes, I was just about to comment on this in the other review as i just got confirmation.. The info in that link is right. The h/w shufflers cross both domains after IVB, therefore, not suffering the bypass penalty when switching through such instructions (perm/shuf/unpack...).
Dec 12 2016
I'm working on getting some confirmation on the latest ones, but most current Core architectures suffer a 1-clk penalty switching between fp and int domains. This doesn't include the Atom line, which can do it for free.
Dec 9 2016
Dec 8 2016
That was timely and convenient, Sanjay.. Thanks :)
Nov 28 2016
lgmt.. At some point, we'll need to add support for the different "lakemont" processors so that we don't affect them all with changes like this.
Oct 17 2016
Hi Michael,
Oct 6 2016
I'll add my 2 cents here, since I did spend some time looking into the underlying x86 architectural reasons for this regression.
Apr 27 2016
In general, I'm not a big fan of blindly aligning loops (on any boundary), as this can cause random effects +ve and -ve.
It's very simple to come up with examples where aligning a loop will cause regressions on certain architectures, specifically in loops that have control flow.
Apr 22 2016
latest changes lgmt..
Apr 6 2016
I'd prefer we used a different way to control this, like perhaps another option.
Apr 4 2016
Mar 30 2016
Thanks, Sanjay.. This lgmt.
Mar 29 2016
Depending on the cost of imull, you can avoid it with:
The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.
Mar 9 2016
lgtm for being an improvement over the existing code.
Feb 29 2016
Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.
Feb 24 2016
I would like to hear from Zia and Mitch before proceeding with this patch because of #7. It would be great to have more perf data. But like I said earlier in the thread, I don't know what our policy is in a situation where a heuristic has lost value for test-suite but still has benefits for other cases.
Feb 22 2016
In D15393#355355, @Gerolf wrote:Hi Zia,
do you have HSW performance numbers for this change? An internal bot is logging a >10% regression for MultiSource/Benchmarks/Ptrdist/ks/ks https://smooshbase.apple.com/perf/db_default/v4/nts/graph?highlight_run=114068&plot.746=313.746.3 (O3 flto) pinned to this change.
Thanks!
Gerolf
Feb 15 2016
Feb 14 2016
Feb 12 2016
Feb 9 2016
Thanks, David, for the review.. And thanks, Sean, for the additional data and analysis on your workloads. I'm glad it showed positive results for you.
Feb 7 2016
In D16907#345277, @spatel wrote:cc'ing some Intel folks both for clarity on the ABI doc and for the potential perf impact.
Feb 5 2016
I'd like to echo Sanjay's concern regarding bypassing of the CMP heuristics.
Jan 21 2016
Jan 11 2016
Right, and I believe that's all taken care of, unless I'm missing something.
Hi Joerg,
Ping..
Jan 4 2016
Ping..
Dec 16 2015
Thanks, Hans.. I changed one of the vectors to SmallVector, and kept the other as a vector, since it's fixed and only gets malloc'ed once. I ran a few benchmarks on the first one and sampled around 3000 frames to see that size 8 captures ~90% of the cases.
Also, another quick comment regarding minimizing stack frame size before my changes:
Dec 15 2015
I'd disable the expansion under minsize, at least.. Otherwise, lgtm.
Dec 14 2015
Hi Sanjay,
Dec 11 2015
Thanks for all the reviews.. I think I've address most of the lower level concerns with an updated patch.
Dec 9 2015
In D15393#306468, @hans wrote:This looks very interesting.
Maybe update the commit message and/or add to the comments in the file why this is beneficial for code size, since it wasn't obvious for me at least. Do I understand correctly that the idea is that addressing objects closer to %esp generally requires less code because it can use a tighter addressing mode?
In D15393#306456, @rnk wrote:Oops, I hit submit too early.
Our current default stack layout algorithm optimizes for stack frame size without accounting for code size from frame offsets. I'm worried that your algorithm may reorder all of the potentially large allocations outside of the 256 bytes of stack that we can address with a smaller offset, and unnecessarily increase stack frame sizes.
I wonder if we can rephrase this as a weighted bin packing problem, where we only have one bin and it has size ~128 bytes, or the max one byte displacement from FP/SP. The object weights would be the use counts, and the goal is to put as many uses into the bin as possible. There's probably a good approximate dynamic programming algorithm that we could use for that.
Oct 22 2015
Oct 21 2015
Thanks, Simon.. Will do.
Oct 20 2015
I've incorporated your comments.
Oct 19 2015
New patch includes changes to use "isConstOrConstSplat" to consolidate the 2 constant checks, as suggested.
Oct 16 2015
Thanks for the follow up and explanation, Sanjay. lgtm..
Oct 15 2015
Thanks, Simon... Let me work on that, and hopefully that will also answer Sanjay's last question regarding needing a vector version in the lit test, to which I didn't comment on earlier.
Thanks Elena and Sanjay for the review and comments.
New patch with changes made based on code review comments. Factored out profitability code, and beefed up lit test.
Just another +1 on Quentin's comment regarding it being nice to see fewer extends in DAGCombine to expose more opportunities.
Oct 14 2015
Sep 17 2015
Sep 2 2015
The overlapping is interesting.. With a 0 mod 16 rsp, if we didn't overlap, we would split a cache line. At a high level, it looks like it's a reasonable strategy (without knowing all the cases in which it'll trigger). The extra immediate generation is, however, weird.
Aug 24 2015
Hi Sanjay,
Aug 19 2015
Hi Sanjay,
Aug 7 2015
Totally understand and agree with your position, and I also appreciate your patience with me.
Aug 6 2015
Hi Quentin,
Aug 3 2015
OK, back from vacation.. Thanks for the additional reviews and comments. Some updates:
Jul 24 2015
I've addressed all of your comment (I hope).
Jul 22 2015
Removed a tab that snuck into last update.
Applied Michael's suggestions.