Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

zansari (Zia Ansari)
User

Projects

User does not belong to any projects.

User Details

User Since
Jul 20 2015, 10:25 AM (426 w, 4 d)

Recent Activity

Oct 13 2017

zansari added a comment to D38313: [InstCombine] Introducing Aggressive Instruction Combine pass.

Hi Amjad,

Oct 13 2017, 4:59 PM

Dec 13 2016

zansari added a comment to D27692: [x86] use a single shufps when it can save instructions.

So does that mean both this and D27684 can be committed safely? For recent hardware it makes no difference (and D27684 possibly saves a few instruction bytes). For older hardware we still save cycles compared to performing the extra shuffles and we should fix up the domain switches where possible to help a little more.

Dec 13 2016, 11:24 AM
zansari added a comment to D27692: [x86] use a single shufps when it can save instructions.

Thanks for the link, Sanjay. Yes, I was just about to comment on this in the other review as i just got confirmation.. The info in that link is right. The h/w shufflers cross both domains after IVB, therefore, not suffering the bypass penalty when switching through such instructions (perm/shuf/unpack...).

Dec 13 2016, 10:10 AM

Dec 12 2016

zansari added a comment to D27684: [X86][SSE] Fix domains for VZEXT_LOAD type instructions.

I'm working on getting some confirmation on the latest ones, but most current Core architectures suffer a 1-clk penalty switching between fp and int domains. This doesn't include the Atom line, which can do it for free.

Dec 12 2016, 4:25 PM

Dec 9 2016

zansari committed rL289153: [InstSimplify] Add "X / 1.0" to SimplifyFDivInst..
[InstSimplify] Add "X / 1.0" to SimplifyFDivInst.
Dec 9 2016, 4:02 AM
zansari closed D27587: Add "X / 1.0" to SimplifyFDivInst by committing rL289153: [InstSimplify] Add "X / 1.0" to SimplifyFDivInst..
Dec 9 2016, 4:01 AM

Dec 8 2016

zansari updated the diff for D27587: Add "X / 1.0" to SimplifyFDivInst.

That was timely and convenient, Sanjay.. Thanks :)

Dec 8 2016, 1:07 PM
zansari retitled D27587: Add "X / 1.0" to SimplifyFDivInst from to Add "X / 1.0" to SimplifyFDivInst.
Dec 8 2016, 11:56 AM

Nov 28 2016

zansari accepted D26895: [X86][LMT] Restrict nop length to one.

lgmt.. At some point, we'll need to add support for the different "lakemont" processors so that we don't affect them all with changes like this.

Nov 28 2016, 11:55 AM

Oct 17 2016

zansari added a comment to D25350: [X86] Enable interleaved memory accesses by default.

Hi Michael,

Oct 17 2016, 1:32 PM

Oct 6 2016

zansari added a comment to D24593: Standford/Bubble sort code restructure.

I'll add my 2 cents here, since I did spend some time looking into the underlying x86 architectural reasons for this regression.

Oct 6 2016, 8:22 AM

Apr 27 2016

zansari added a comment to D19558: Codegen: [X86] Set preferred loop alignment to 32 bytes..

In general, I'm not a big fan of blindly aligning loops (on any boundary), as this can cause random effects +ve and -ve.
It's very simple to come up with examples where aligning a loop will cause regressions on certain architectures, specifically in loops that have control flow.

Apr 27 2016, 11:53 AM

Apr 22 2016

zansari added a comment to D19172: New optimization bisect implementation (now modeled on optnone handling).

latest changes lgmt..

Apr 22 2016, 3:07 PM

Apr 6 2016

zansari added a comment to D18782: Reduce unroll of constant bounds loop with TripCount that is not modulo of unroll factor..

I'd prefer we used a different way to control this, like perhaps another option.

Apr 6 2016, 2:32 PM

Apr 4 2016

zansari committed rL265337: Enable unroll for constant bound loops when TripCount is not modulo of unroll….
Enable unroll for constant bound loops when TripCount is not modulo of unroll…
Apr 4 2016, 12:30 PM
zansari closed D18290: Unroll of loops with constant bounds by committing rL265337: Enable unroll for constant bound loops when TripCount is not modulo of unroll….
Apr 4 2016, 12:30 PM

Mar 30 2016

zansari accepted D18566: [x86] use SSE/AVX ops for non-zero memsets (PR27100).

Thanks, Sanjay.. This lgmt.

Mar 30 2016, 2:10 PM

Mar 29 2016

zansari added a comment to D18566: [x86] use SSE/AVX ops for non-zero memsets (PR27100).

Depending on the cost of imull, you can avoid it with:

Mar 29 2016, 1:22 PM
zansari added a comment to D18566: [x86] use SSE/AVX ops for non-zero memsets (PR27100).
The memset-2.ll tests look quite awkward in the way they splat the byte value into an XMM reg; imul isn't generally cheap.
Mar 29 2016, 1:11 PM

Mar 9 2016

zansari accepted D18000: [x86] fix cost model inaccuracy for vector memory ops.

lgtm for being an improvement over the existing code.

Mar 9 2016, 11:34 AM

Feb 29 2016

zansari added a comment to D17288: [CodeGenPrepare] Do select to branch transform when cmp's operand is expensive..

Even if branch prediction is failed, we may not lost anything. Because fdiv float/double takes more cycles than branch prediction miss penalty.
And if branch prediction is correct, we can hide fdiv's execution cycles.

Feb 29 2016, 11:03 AM

Feb 24 2016

zansari accepted D16836: [CodeGenPrepare] Remove load-based heuristic.
Feb 24 2016, 3:47 PM
zansari added a comment to D16836: [CodeGenPrepare] Remove load-based heuristic.

I would like to hear from Zia and Mitch before proceeding with this patch because of #7. It would be great to have more perf data. But like I said earlier in the thread, I don't know what our policy is in a situation where a heuristic has lost value for test-suite but still has benefits for other cases.

Feb 24 2016, 3:46 PM

Feb 22 2016

zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..

Hi Zia,

do you have HSW performance numbers for this change? An internal bot is logging a >10% regression for MultiSource/Benchmarks/Ptrdist/ks/ks https://smooshbase.apple.com/perf/db_default/v4/nts/graph?highlight_run=114068&plot.746=313.746.3 (O3 flto) pinned to this change.

Thanks!
Gerolf

Feb 22 2016, 10:00 AM

Feb 15 2016

zansari committed rL260917.
Feb 15 2016, 3:49 PM
zansari closed D15393: [X86] Order the local stack symbols to improve code size and locality. by committing rL260917.
Feb 15 2016, 3:48 PM
zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..
Feb 15 2016, 3:42 PM

Feb 14 2016

zansari committed rL260876: Fixed non-NULL terminating array bug in SanitizerCommon.StartSubprocessTest….
Fixed non-NULL terminating array bug in SanitizerCommon.StartSubprocessTest…
Feb 14 2016, 9:16 PM
zansari closed D17228: Fix bug in SanitizerCommon.StartSubprocessTest causing flaky behavior. by committing rL260876: Fixed non-NULL terminating array bug in SanitizerCommon.StartSubprocessTest….
Feb 14 2016, 9:16 PM

Feb 12 2016

zansari retitled D17228: Fix bug in SanitizerCommon.StartSubprocessTest causing flaky behavior. from to Fix bug in SanitizerCommon.StartSubprocessTest causing flaky behavior..
Feb 12 2016, 4:12 PM

Feb 9 2016

zansari added inline comments to D15393: [X86] Order the local stack symbols to improve code size and locality..
Feb 9 2016, 9:13 PM
zansari updated the diff for D15393: [X86] Order the local stack symbols to improve code size and locality..

Thanks, David, for the review.. And thanks, Sean, for the additional data and analysis on your workloads. I'm glad it showed positive results for you.

Feb 9 2016, 8:56 PM

Feb 7 2016

zansari added a comment to D16907: [X86] Don't zero/sign-extend i1 or i8 return values to 32 bits (PR22532).

cc'ing some Intel folks both for clarity on the ABI doc and for the potential perf impact.

Feb 7 2016, 12:35 AM

Feb 5 2016

zansari added a comment to D16836: [CodeGenPrepare] Remove load-based heuristic.

I'd like to echo Sanjay's concern regarding bypassing of the CMP heuristics.

Feb 5 2016, 11:17 AM

Jan 21 2016

zansari added a reviewer for D15393: [X86] Order the local stack symbols to improve code size and locality.: majnemer.
Jan 21 2016, 10:53 AM

Jan 11 2016

zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..

Right, and I believe that's all taken care of, unless I'm missing something.

Jan 11 2016, 10:22 AM
zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..

Hi Joerg,

Jan 11 2016, 9:39 AM
zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..

Ping..

Jan 11 2016, 9:08 AM

Jan 4 2016

zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..

Ping..

Jan 4 2016, 7:04 AM

Dec 16 2015

zansari updated the diff for D15393: [X86] Order the local stack symbols to improve code size and locality..

Thanks, Hans.. I changed one of the vectors to SmallVector, and kept the other as a vector, since it's fixed and only gets malloc'ed once. I ran a few benchmarks on the first one and sampled around 3000 frames to see that size 8 captures ~90% of the cases.

Dec 16 2015, 2:05 PM
zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..

Also, another quick comment regarding minimizing stack frame size before my changes:

Dec 16 2015, 10:53 AM

Dec 15 2015

zansari added a comment to D15294: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475).

I'd disable the expansion under minsize, at least.. Otherwise, lgtm.

Dec 15 2015, 12:04 PM

Dec 14 2015

zansari added a comment to D15294: [x86] inline calls to fmaxf / llvm.maxnum.f32 using maxss (PR24475).

Hi Sanjay,

Dec 14 2015, 12:45 PM

Dec 11 2015

zansari updated the diff for D15393: [X86] Order the local stack symbols to improve code size and locality..

Thanks for all the reviews.. I think I've address most of the lower level concerns with an updated patch.

Dec 11 2015, 1:21 PM

Dec 9 2015

zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..
In D15393#306468, @hans wrote:

This looks very interesting.

Maybe update the commit message and/or add to the comments in the file why this is beneficial for code size, since it wasn't obvious for me at least. Do I understand correctly that the idea is that addressing objects closer to %esp generally requires less code because it can use a tighter addressing mode?

Dec 9 2015, 4:15 PM
zansari added a comment to D15393: [X86] Order the local stack symbols to improve code size and locality..
In D15393#306456, @rnk wrote:

Oops, I hit submit too early.

Our current default stack layout algorithm optimizes for stack frame size without accounting for code size from frame offsets. I'm worried that your algorithm may reorder all of the potentially large allocations outside of the 256 bytes of stack that we can address with a smaller offset, and unnecessarily increase stack frame sizes.

I wonder if we can rephrase this as a weighted bin packing problem, where we only have one bin and it has size ~128 bytes, or the max one byte displacement from FP/SP. The object weights would be the use counts, and the goal is to put as many uses into the bin as possible. There's probably a good approximate dynamic programming algorithm that we could use for that.

Dec 9 2015, 4:10 PM
zansari retitled D15393: [X86] Order the local stack symbols to improve code size and locality. from to [X86] Order the local stack symbols to improve code size and locality..
Dec 9 2015, 1:29 PM

Oct 22 2015

zansari committed rL251028: [X86] - Catch extra combine opportunities for redundant imuls..
[X86] - Catch extra combine opportunities for redundant imuls.
Oct 22 2015, 9:17 AM
zansari closed D13740: Catch combine opportunities for redundant imuls by committing rL251028: [X86] - Catch extra combine opportunities for redundant imuls..
Oct 22 2015, 9:16 AM

Oct 21 2015

zansari updated the diff for D13740: Catch combine opportunities for redundant imuls.

Thanks, Simon.. Will do.

Oct 21 2015, 8:21 AM

Oct 20 2015

zansari updated the diff for D13740: Catch combine opportunities for redundant imuls.

I've incorporated your comments.

Oct 20 2015, 12:13 PM

Oct 19 2015

zansari updated the diff for D13740: Catch combine opportunities for redundant imuls.

New patch includes changes to use "isConstOrConstSplat" to consolidate the 2 constant checks, as suggested.

Oct 19 2015, 5:57 PM

Oct 16 2015

zansari added a comment to D13757: [x86] promote 'add nsw' to a wider type to allow more combines.

Thanks for the follow up and explanation, Sanjay. lgtm..

Oct 16 2015, 10:42 AM

Oct 15 2015

zansari added a comment to D13740: Catch combine opportunities for redundant imuls.

Thanks, Simon... Let me work on that, and hopefully that will also answer Sanjay's last question regarding needing a vector version in the lit test, to which I didn't comment on earlier.

Oct 15 2015, 4:35 PM
zansari added a comment to D13740: Catch combine opportunities for redundant imuls.

Thanks Elena and Sanjay for the review and comments.

Oct 15 2015, 1:18 PM
zansari updated the diff for D13740: Catch combine opportunities for redundant imuls.

New patch with changes made based on code review comments. Factored out profitability code, and beefed up lit test.

Oct 15 2015, 1:10 PM
zansari added a comment to D13757: [x86] promote 'add nsw' to a wider type to allow more combines.

Just another +1 on Quentin's comment regarding it being nice to see fewer extends in DAGCombine to expose more opportunities.

Oct 15 2015, 12:17 PM

Oct 14 2015

zansari retitled D13740: Catch combine opportunities for redundant imuls from to Catch combine opportunities for redundant imuls.
Oct 14 2015, 1:28 PM

Sep 17 2015

zansari committed rL247901: Test commit..
Test commit.
Sep 17 2015, 9:52 AM

Sep 2 2015

zansari added a comment to D12543: [x86] fix allowsMisalignedMemoryAccesses() for 8-byte and smaller accesses.

The overlapping is interesting.. With a 0 mod 16 rsp, if we didn't overlap, we would split a cache line. At a high level, it looks like it's a reasonable strategy (without knowing all the cases in which it'll trigger). The extra immediate generation is, however, weird.

Sep 2 2015, 8:37 AM

Aug 24 2015

zansari added a comment to D12288: make fast unaligned memory accesses implicit with SSE4.2 or SSE4a.

Hi Sanjay,

Aug 24 2015, 1:14 PM

Aug 19 2015

zansari added a comment to D12154: [x86] invert logic for attribute 'FeatureFastUAMem'.

Hi Sanjay,

Aug 19 2015, 12:30 PM

Aug 7 2015

zansari updated the diff for D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..

Totally understand and agree with your position, and I also appreciate your patience with me.

Aug 7 2015, 4:47 PM

Aug 6 2015

zansari added a comment to D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..

Hi Quentin,

Aug 6 2015, 10:32 PM

Aug 3 2015

zansari updated the diff for D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..

OK, back from vacation.. Thanks for the additional reviews and comments. Some updates:

Aug 3 2015, 11:17 AM

Jul 24 2015

zansari updated the diff for D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..

I've addressed all of your comment (I hope).

Jul 24 2015, 2:19 PM

Jul 22 2015

zansari updated the diff for D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..

Removed a tab that snuck into last update.

Jul 22 2015, 10:42 PM
zansari updated the diff for D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..

Applied Michael's suggestions.

Jul 22 2015, 10:39 PM

Jul 20 2015

zansari updated D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..
Jul 20 2015, 12:44 PM
zansari updated D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..
Jul 20 2015, 12:35 PM
zansari updated D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint..
Jul 20 2015, 12:33 PM
zansari retitled D11363: Allow merging of immediates within a basic block for code size savings and reduced footprint. from to Allow merging of immediates within a basic block for code size savings and reduced footprint..
Jul 20 2015, 12:29 PM