Page MenuHomePhabricator

az (Abderrazek Zaafrani)
User

Projects

User does not belong to any projects.

User Details

User Since
Feb 10 2016, 2:11 PM (371 w, 4 d)

Recent Activity

Mar 5 2020

az accepted D73059: [IndVarSimplify] Extend previous special case for load use instruction to any narrow type loop variant to avoid extra trunc instruction.
Mar 5 2020, 10:57 AM · Restricted Project
az added a comment to D73059: [IndVarSimplify] Extend previous special case for load use instruction to any narrow type loop variant to avoid extra trunc instruction.

Since you discovered this problem with non-default pass pipeline, then I do not need to see your test. With the default llvm passes, it was easy for me to find tests with load where the patch can be beneficial because of limitation in alias analysis preventing some optimizations and this patch can clean up things. But, I was not able back then to find meaningful test with non-load where this patch can help. It is great that you have a test (with your non-default passes) where this patch can be useful. It seems that both of us tested this code very well on ARM. Also, theoretically removing an extra instruction should be beneficial for other architectures too.

Mar 5 2020, 10:57 AM · Restricted Project
az added a comment to D73059: [IndVarSimplify] Extend previous special case for load use instruction to any narrow type loop variant to avoid extra trunc instruction.

The patch looks good to me. It is a generalization of the original patch. The only thing missing here is to describe the testing results or post them here (even though on paper, getting rid of a truncate instruction should be beneficial).

Mar 5 2020, 7:40 AM · Restricted Project

Sep 19 2019

az added a comment to D67645: [aarch64] add def-pats for dot product.

What if the code is written with intrinsic but using mul, reduce (say similar to last test in this patch), then this patch will optimize that into dot product instruction. So, for legacy code that was written with old intrinsic, then this patch will remain useful even after dot product is implemented in the vectorizer.

Sep 19 2019, 8:53 AM · Restricted Project

Sep 18 2019

az added a comment to D67645: [aarch64] add def-pats for dot product.

There are few things missing in current work such as indexed dot product or what they call s/udot (vector, by element) in the ARM document (no need to do it now but a comment about that would help). There is also sve dot product. We need to port this code to SVE.

Sep 18 2019, 10:30 AM · Restricted Project

Mar 6 2019

az committed rG5ced5961984b: [AArch64] Improve FP16 instruction selection for vector round and vector conver… (authored by az).
[AArch64] Improve FP16 instruction selection for vector round and vector conver…
Mar 6 2019, 12:32 PM
az committed rL355545: [AArch64] Improve FP16 instruction selection for vector round and vector conver….
[AArch64] Improve FP16 instruction selection for vector round and vector conver…
Mar 6 2019, 12:29 PM

Mar 1 2019

az created D58855: [AArch64] Improve FP16 instruction selection for vector round and vector convert from half instructions.
Mar 1 2019, 4:29 PM · Restricted Project

Feb 28 2019

az committed rGabfd10807ca6: [AArch64] Improve FP16 vector convert from short instructions. https://reviews. (authored by az).
[AArch64] Improve FP16 vector convert from short instructions. https://reviews.
Feb 28 2019, 12:22 PM
az committed rL355134: [AArch64] Improve FP16 vector convert from short instructions..
[AArch64] Improve FP16 vector convert from short instructions.
Feb 28 2019, 12:21 PM

Feb 27 2019

az committed rG2fc498a652dc: [AArch64] Generate FP16 vector compare instructions. https://reviews.llvm. (authored by az).
[AArch64] Generate FP16 vector compare instructions. https://reviews.llvm.
Feb 27 2019, 4:32 PM
az committed rL355050: [AArch64] Generate FP16 vector compare instructions..
[AArch64] Generate FP16 vector compare instructions.
Feb 27 2019, 4:30 PM

Feb 22 2019

az created D58563: [AArch64] Improve FP16 vector convert from short instructions.
Feb 22 2019, 4:38 PM · Restricted Project
az created D58561: [AArch64] Generate FP16 vector compare instructions.
Feb 22 2019, 4:28 PM · Restricted Project

Sep 7 2018

az committed rL341726: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Laod….
[SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Laod…
Sep 7 2018, 3:43 PM
az closed D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Sep 7 2018, 3:43 PM
az updated the diff for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Sep 7 2018, 1:44 PM

Sep 6 2018

az updated the diff for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Sep 6 2018, 2:07 PM
az added inline comments to D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Sep 6 2018, 10:50 AM
az updated the diff for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Sep 6 2018, 10:48 AM

Aug 23 2018

az added a reviewer for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand.: sebpop.
Aug 23 2018, 2:10 PM

Jul 27 2018

az updated the diff for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..

Some cleaning and update to comments.

Jul 27 2018, 5:20 PM
az created D49941: [ARM] Add ARMv8.2-A FP16 scalar intrinsic.
Jul 27 2018, 3:02 PM

Jul 25 2018

az added inline comments to D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Jul 25 2018, 4:58 PM
az added a comment to D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..

Your example doesn't really help make the case for this patch. The load in that test is actually loop-invariant; we just don't figure that out until after indvars transforms the induction variable. Probably LICM could be fixed to handle this case earlier. Then ultimately, the multiply goes away; the extra operation you're trying to get rid of is actually the sign-extension of a PHI node created by LSR. LSR and/or SCEV could probably be fixed so this produces an i64 PHI instead. Either of those fixes would be more straightforward and more obviously profitable.

Jul 25 2018, 4:45 PM
az updated the diff for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Jul 25 2018, 4:39 PM

Jul 24 2018

az added a comment to D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..

Ping

Jul 24 2018, 8:14 AM

Jul 13 2018

az added inline comments to D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Jul 13 2018, 4:03 PM
az added a comment to D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Jul 13 2018, 3:59 PM
az updated the diff for D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..

How does the code generation actually change on aarch64? As far as I can tell, you're basically just changing an "sxtw #2" to an "lsl #2"; does that save a uop on A72?

Jul 13 2018, 3:55 PM

Jul 10 2018

az created D49151: [SimplifyIndVar] Avoid generating truncate instructions with non-hoisted Load operand..
Jul 10 2018, 12:55 PM

Mar 22 2018

az committed rC328277: [ARM] Add ARMv8.2-A FP16 vector intrinsic.
[ARM] Add ARMv8.2-A FP16 vector intrinsic
Mar 22 2018, 5:11 PM
az committed rL328277: [ARM] Add ARMv8.2-A FP16 vector intrinsic.
[ARM] Add ARMv8.2-A FP16 vector intrinsic
Mar 22 2018, 5:11 PM

Mar 20 2018

az committed rC328038: [AArch64] Add vmulxh_lane fp16 vector intrinsic.
[AArch64] Add vmulxh_lane fp16 vector intrinsic
Mar 20 2018, 1:40 PM
az committed rL328038: [AArch64] Add vmulxh_lane fp16 vector intrinsic.
[AArch64] Add vmulxh_lane fp16 vector intrinsic
Mar 20 2018, 1:40 PM
az committed rL328035: [AArch64] Add vmulxh_lane fp16 vector intrinsic.
[AArch64] Add vmulxh_lane fp16 vector intrinsic
Mar 20 2018, 1:30 PM

Mar 19 2018

az updated the diff for D44591: [AArch64] Add vmulxh_lane FP16 vector intrinsic.

add LLVM codegen tests as suggested in the reviews.

Mar 19 2018, 3:47 PM

Mar 16 2018

az added a comment to D44222: [AArch64] Add vmulxh_lane FP16 intrinsics.

Was not able to update this particular review with the new code, So I created a new one in https://reviews.llvm.org/D44591

Mar 16 2018, 4:03 PM
az created D44591: [AArch64] Add vmulxh_lane FP16 vector intrinsic.
Mar 16 2018, 3:44 PM

Mar 9 2018

az committed rC327189: [ARM] Add ARMv8.2-A FP16 vector intrinsic.
[ARM] Add ARMv8.2-A FP16 vector intrinsic
Mar 9 2018, 3:43 PM
az committed rL327189: [ARM] Add ARMv8.2-A FP16 vector intrinsic.
[ARM] Add ARMv8.2-A FP16 vector intrinsic
Mar 9 2018, 3:43 PM

Mar 8 2018

az added inline comments to D44222: [AArch64] Add vmulxh_lane FP16 intrinsics.
Mar 8 2018, 5:20 PM

Feb 22 2018

az created D43650: [ARM] Add ARMv8.2-A FP16 vector intrinsics.
Feb 22 2018, 4:02 PM

Feb 12 2018

az closed D42993: [AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic.

Committed as r324940 and r324912

Feb 12 2018, 2:09 PM
az committed rL324940: [AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic - clang portion.
[AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic - clang portion
Feb 12 2018, 1:28 PM
az committed rC324940: [AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic - clang portion.
[AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic - clang portion
Feb 12 2018, 1:28 PM
az committed rL324912: [AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic - llvm portion.
[AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic - llvm portion
Feb 12 2018, 9:37 AM

Feb 8 2018

az updated the diff for D42993: [AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic.

Question about the failures: I am now wondering if this means we were and still are missing tests?

Feb 8 2018, 3:40 PM

Feb 6 2018

az created D42993: [AArch64] Fixes for ARMv8.2-A FP16 scalar intrinsic.
Feb 6 2018, 4:08 PM

Jan 19 2018

az committed rC323006: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
[AArch64] Add ARMv8.2-A FP16 scalar intrinsics
Jan 19 2018, 3:13 PM
az committed rL323006: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
[AArch64] Add ARMv8.2-A FP16 scalar intrinsics
Jan 19 2018, 3:13 PM
az committed rL323005: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
[AArch64] Add ARMv8.2-A FP16 scalar intrinsics
Jan 19 2018, 3:13 PM

Jan 11 2018

az added inline comments to D41792: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
Jan 11 2018, 2:26 PM
az updated the diff for D41792: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
Jan 11 2018, 1:50 PM

Jan 10 2018

az added inline comments to D41792: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
Jan 10 2018, 4:54 PM
az updated the diff for D41792: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
Jan 10 2018, 4:06 PM

Jan 5 2018

az created D41792: [AArch64] Add ARMv8.2-A FP16 scalar intrinsics.
Jan 5 2018, 4:41 PM

Dec 21 2017

az committed rC321301: [AArch64] Enable fp16 data type for the Builtin for AArch64 only..
[AArch64] Enable fp16 data type for the Builtin for AArch64 only.
Dec 21 2017, 12:11 PM
az committed rL321301: [AArch64] Enable fp16 data type for the Builtin for AArch64 only..
[AArch64] Enable fp16 data type for the Builtin for AArch64 only.
Dec 21 2017, 12:11 PM
az committed rC321294: [AARch64] Add ARMv8.2-A FP16 vector intrinsics.
[AARch64] Add ARMv8.2-A FP16 vector intrinsics
Dec 21 2017, 11:21 AM
az committed rL321294: [AARch64] Add ARMv8.2-A FP16 vector intrinsics.
[AARch64] Add ARMv8.2-A FP16 vector intrinsics
Dec 21 2017, 11:20 AM
az added a comment to D41360: [AARCH64] Enable fp16 data type for the Builtin in aarch64 only.

Looks reasonable to me.

Just checking, do we need those changes in CodeGen/arm_neon_intrinsics.c? I mean, trunk is already using i16 in these tests.

Dec 21 2017, 8:20 AM

Dec 18 2017

az created D41360: [AARCH64] Enable fp16 data type for the Builtin in aarch64 only.
Dec 18 2017, 9:27 AM

Dec 11 2017

az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Code has been up-streamed in https://reviews.llvm.org/rL320123 and https://reviews.llvm.org/rL320204.

Dec 11 2017, 4:54 PM
az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.

Dec 11 2017, 4:51 PM

Dec 8 2017

az committed rL320204: [AArch64] Rename AArch64VecorByElementOpt.cpp into AArch64SIMDInstrOpt.cpp to….
[AArch64] Rename AArch64VecorByElementOpt.cpp into AArch64SIMDInstrOpt.cpp to…
Dec 8 2017, 2:05 PM

Dec 7 2017

az committed rL320123: [AArch64] Avoid SIMD interleaved store instruction for Exynos..
[AArch64] Avoid SIMD interleaved store instruction for Exynos.
Dec 7 2017, 4:59 PM

Nov 29 2017

az added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 29 2017, 4:19 PM
az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Added caching mechanism to save decisions per target of whether to go into the interleaved store replacement pass or not.

Nov 29 2017, 3:40 PM

Nov 21 2017

az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

i

Nov 21 2017, 9:35 AM

Nov 16 2017

az added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 16 2017, 11:27 AM

Nov 10 2017

az added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 10 2017, 9:49 AM
az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Code reuse and simplification due to the use of a table that has instruction replacement info.

Nov 10 2017, 9:43 AM

Nov 7 2017

az added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 7 2017, 7:30 AM

Nov 6 2017

az added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 6 2017, 5:09 PM
az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Adding subtarget as part of key and simplifying some code.

Nov 6 2017, 5:06 PM

Nov 2 2017

az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Nov 2 2017, 3:10 PM

Oct 31 2017

az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Made caching mechanism for instruction replacement decisions to work across functions (instead of within function only) by declaring SIMDInstrTable as a global variable.

Oct 31 2017, 9:34 AM

Oct 30 2017

az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Hmmmm.... this seems to be a tough one to find the right tradeoffs....
My personal opinions/observations:

  1. Making ShouldExitEarly stop the whole pass even if some of the optimizations would trigger is a hack that makes this pass operate in a non-obvious way and that will bite us in the future.
  2. I guess there's a chance even more peephole-y patterns will be added to this pass in the future, which would make ShouldExitEarly even more expensive to check for all rewriting rules implemented. But also, if that function would take short cuts and heuristics, the pass would operate in more surprising and non-obvious ways.

The above made me wonder if to keep compile-time under control, it wouldn't be better to integrate this in one of the existing frameworks for peepholes rather than in a pass of its own. And then I looked at D21571 where this pass got introduced and saw such approaches were probably tried and rejected in early version of that review.
Right now, my best guess of the easiest way forward is to make ShouldExitEarly compute the profitability of all rewrite rules implemented, and let it cache the results of that analysis for each subtarget it encounters. That way, the cost of this computation should be amortized out over all functions the pass will handle, resulting in not having too much impact on compile time?
Having said that, I don't know of another pass that has a caching mechanism like this. But I also haven't tried looking for such an example. I'm not sure if there's going to be a gotcha in trying to cache those results somehow.
What do you think?

Thanks,

Kristof

Oct 30 2017, 5:26 PM
az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

A version that addresses the latest comments and in particular a more complete version of shouldExitEarly().

Oct 30 2017, 4:34 PM

Oct 25 2017

az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

Oct 25 2017, 10:14 AM
az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?

Oct 25 2017, 8:05 AM

Oct 13 2017

az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

Oct 13 2017, 8:03 AM
az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Oct 13 2017, 7:58 AM

Oct 11 2017

az added inline comments to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Oct 11 2017, 4:22 PM
az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Kristof, thanks again for the new feedback. Uploaded a new version.

Oct 11 2017, 4:11 PM

Oct 9 2017

az updated the diff for D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

Kristof, Thank you for the feedback. Very useful comments to make the patch more readable and more maintainable.
I addressed your comments with the new version.

Oct 9 2017, 4:55 PM

Sep 29 2017

az added a comment to D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.

ping

Sep 29 2017, 4:00 PM

Sep 25 2017

az updated subscribers of D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Sep 25 2017, 5:02 AM
az created D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.
Sep 25 2017, 5:02 AM

Jun 20 2017

az closed D32511: [AArch64] Add ARMv8.2-A FP16 vector intrinsics.

https://reviews.llvm.org/rL305820

Jun 20 2017, 12:05 PM
az committed rL305820: [AArch64] ADD ARMv.2-A FP16 vector intrinsics.
[AArch64] ADD ARMv.2-A FP16 vector intrinsics
Jun 20 2017, 11:56 AM
az closed D34161: [AArch64] Add ARMv8.2-A FP16 vector intrinsics - Continuation by committing rL305820: [AArch64] ADD ARMv.2-A FP16 vector intrinsics.
Jun 20 2017, 11:55 AM

Jun 19 2017

az updated the diff for D34161: [AArch64] Add ARMv8.2-A FP16 vector intrinsics - Continuation.

I am resubmitting the combination of D32511 and D34161 in here to avoid any confusion.

Jun 19 2017, 1:41 PM

Jun 16 2017

az added a comment to D34161: [AArch64] Add ARMv8.2-A FP16 vector intrinsics - Continuation.

Ok, I will upload the combination of the two patches here.

Jun 16 2017, 4:49 PM

Jun 15 2017

az added a comment to D34161: [AArch64] Add ARMv8.2-A FP16 vector intrinsics - Continuation.

Not sure if I am missing something, but the new regression tests arm-v8.2a-neon-intrinsics.c and aarch64-neon-intrinsics.c are failing for me (after first applying patch D32511 and then this one here in D34161). For example, the very first test in arm-v8.2a-neon-intrinsics.c checks that the function argument is <4 x half> %a, but <4 x half> %1 is generated. With anything other than -O0 you will get the %a (and then it becomes a tailcall).

Jun 15 2017, 12:10 PM

Jun 13 2017

az created D34161: [AArch64] Add ARMv8.2-A FP16 vector intrinsics - Continuation.
Jun 13 2017, 12:57 PM

Jun 1 2017

az committed rL304493: [AArch64] Add ARMv8.2-A FP16 vefctor intrinsics.
[AArch64] Add ARMv8.2-A FP16 vefctor intrinsics
Jun 1 2017, 4:23 PM

May 30 2017

az committed rL304259: Add latency info for Exynos interleaved Load/Store instructions..
Add latency info for Exynos interleaved Load/Store instructions.
May 30 2017, 5:21 PM

May 15 2017

az updated the diff for D32511: [AArch64] Add ARMv8.2-A FP16 vector intrinsics.

Removed unnecessary f32 intrinsic tests; removed tabs

May 15 2017, 4:11 PM