Page MenuHomePhabricator

jaykang10 (JinGu Kang)
User

Projects

User does not belong to any projects.

User Details

User Since
Jun 18 2014, 2:14 AM (414 w, 2 d)

Recent Activity

Mon, May 23

jaykang10 added a reverting change for rG42ebfa826947: Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth": rGbb82f746129f: Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"".
Mon, May 23, 8:18 AM · Restricted Project, Restricted Project
jaykang10 committed rGbb82f746129f: Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth"" (authored by jaykang10).
Revert "Revert "[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth""
Mon, May 23, 8:18 AM · Restricted Project, Restricted Project
jaykang10 closed D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Mon, May 23, 8:18 AM · Restricted Project, Restricted Project
jaykang10 updated subscribers of D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

I have checked the commit from https://reviews.llvm.org/D125918 has fixed the stage 2 build error with this patch.
Thanks for fixing it! @paulwalker-arm and @peterwaller-arm
Let me push this patch again.

Mon, May 23, 8:01 AM · Restricted Project, Restricted Project

Fri, Apr 29

jaykang10 added a comment to D118090: [ScalarEvolution] Handle <= and >= in non infinite loops.

I am also getting performance degradation from this patch. The backedge count with the max is considered as high cost and it blocks to rewrite exit value in IndVarSimplify pass. It affects LSR's decision to rewrite IV and it causes additional instructions in loop...
As far as I understand, the expression of backedge count describes (max(End,Start)-Start)/Stride normally. If the expression satisfies with (max(RHS,Start) > Start - Stride, the expression is refined simply.
In order to prove (max(RHS,Start) > Start - Stride, SCEV is using below code.

// Can we prove (max(RHS,Start) > Start - Stride?
if (isLoopEntryGuardedByCond(L, Cond, OrigStartMinusStride, OrigStart) &&
    isLoopEntryGuardedByCond(L, Cond, OrigStartMinusStride, OrigRHS)) {

Let's see the reduced example of @nikic on https://github.com/llvm/llvm-project/issues/54191 with above code.

; RUN: opt -S -passes='print<scalar-evolution>' < %s
define void @test(i64 %n) mustprogress {
entry:
  %guard = icmp sgt i64 %n, 1
  br i1 %guard, label %loop, label %exit
Fri, Apr 29, 5:07 AM · Restricted Project, Restricted Project

Apr 19 2022

jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

It looks the SVE target's data layout is missing 512-bit vector's alignment. The data layout does not mention the alignment of 512-bit vector so the alignment is same with its size. It is bigger than stack's alignment which is 128-bits and it causes stack re-alignment...

Apr 19 2022, 9:18 AM · Restricted Project, Restricted Project

Apr 13 2022

jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

um... I can reproduce the build error now... Maybe, there was something wrong with my build...
Thanks for help @omjavaid

Apr 13 2022, 8:48 PM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I did have a completely difference experience, here are steps to reproduce:

Stage 1:

CC=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang CXX=$(pwd)/clang+llvm-13.0.1-aarch64-linux-gnu/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'
ninja

Stage 2:

CC=$(pwd)/stage1/bin/clang CXX=$(pwd)/stage1/bin/clang++
cmake -G Ninja ../llvm/llvm \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLVM_ENABLE_ASSERTIONS=True \
	'-DLLVM_LIT_ARGS='"'"'-v'"'"'' \
	-DCMAKE_INSTALL_PREFIX=../stage$1.install \
	-DCMAKE_C_COMPILER=$CC \
	-DCMAKE_CXX_COMPILER=$CXX \
	'-DCMAKE_C_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
	'-DCMAKE_CXX_FLAGS='"'"'-mcpu=a64fx -msve-vector-bits=512 -mllvm -treat-scalable-fixed-error-as-warning=false'"'"'' \
       	-DLLVM_ENABLE_LLD=True \
	'-DLLVM_LIT_ARGS='"'"'-v -j12'"'"'' \
	'-DLLVM_ENABLE_PROJECTS=llvm;mlir;clang-tools-extra;compiler-rt;clang;lld;flang'

ninja llvm-tblgen

/home/omair.javaid/work/llvm-test/stage2/bin/llvm-tblgen -gen-dag-isel -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC -I/home/omair.javaid/work/llvm-test/stage2/include -I/home/omair.javaid/work/llvm-test/llvm/llvm/include -I /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target -omit-comments /home/omair.javaid/work/llvm-test/llvm/llvm/lib/Target/PowerPC/PPC.td --write-if-changed -o lib/Target/PowerP
Apr 13 2022, 6:02 PM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

I think it would be best to figure out if there's actually an issue/miscompile before re-landing. As is, the patch should only enable shouldMaximizeVectorBandwidth already, right?

Apr 13 2022, 8:56 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

um... I have tried to reproduce it on a64fx machine. It looks it is failed with the big number of parallel tasks of ninja build. When I run the failed command manually, it is passed... If I reduce the number of the parallel tasks with ninja build like ninja -j16, the build is also passed... it could be memory allocation error from the bigger memory allocation than before...
The stage2 build and check-all are passed on Cortex-A72 machine...
To be safe, for now, I would like to disable shouldMaximizeVectorBandwidth for SVE...
Let me update this patch with it.

Apr 13 2022, 8:23 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

This broke LLVM AArch64 buildbot clang-aarch64-sve-vls-2stage
https://lab.llvm.org/buildbot/#/builders/176/builds/1515
llvm-tblgen crashes after applying this patch.

I have reverted it for now. Kindly have a look.

Apr 13 2022, 12:15 AM · Restricted Project, Restricted Project

Apr 5 2022

jaykang10 committed rG64b6192e8129: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth (authored by jaykang10).
[AArch64] Set maximum VF with shouldMaximizeVectorBandwidth
Apr 5 2022, 5:19 AM · Restricted Project, Restricted Project
jaykang10 closed D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Apr 5 2022, 5:19 AM · Restricted Project, Restricted Project
jaykang10 updated the diff for D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

Following comments from @fhahn, updated patch.

Apr 5 2022, 4:05 AM · Restricted Project, Restricted Project
jaykang10 added inline comments to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Apr 5 2022, 3:26 AM · Restricted Project, Restricted Project

Apr 4 2022

jaykang10 updated the diff for D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

Following the comment from @dmgreen, updated code.

Apr 4 2022, 6:37 AM · Restricted Project, Restricted Project
jaykang10 added inline comments to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Apr 4 2022, 6:32 AM · Restricted Project, Restricted Project
jaykang10 added inline comments to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Apr 4 2022, 6:14 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D123007: [AArch64] Increase cost of v2i64 multiplies.

Thank for patch. I am +1.
I hope other aarch64 folks are also happy with this cost.

Apr 4 2022, 2:37 AM · Restricted Project, Restricted Project

Mar 22 2022

jaykang10 added a comment to D121788: [AArch64] Increase MaxInterleaveFactor to 4.

Thanks for comment! @dmgreen

Do you have any performance results for changing this? I've not had much luck with trying it in the past, and it obviously can change quite a lot. It can certainly help in places, but I've found that if you turn it up too high you just end up over-unrolling loops, not getting into the fast loop body as much. It can obviously depend on the input code and loop counts tough. Perhaps it needs some better costmodelling?

I was able to see the overall performance number slightly up for an internal benchmark on neoverse-n1 but we would need to tune something like cost model according to micro architectures.

I was still hoping to get D118979 in because it should help quite a bit - and it on it's own increases the number of items processed per vector element, and this will increase it further. We have cleaned up quite a few of the places it doesn't do as well, there are just a few that have been stuck in review a while. Perhaps it makes sense to try and push that through, then re-evaluate this on top?

I agree with you. Let's visit this patch later.

Mar 22 2022, 8:51 AM · Restricted Project, Restricted Project

Mar 21 2022

jaykang10 added a comment to D121788: [AArch64] Increase MaxInterleaveFactor to 4.

Ping

Mar 21 2022, 9:28 AM · Restricted Project, Restricted Project

Mar 16 2022

jaykang10 requested review of D121788: [AArch64] Increase MaxInterleaveFactor to 4.
Mar 16 2022, 3:58 AM · Restricted Project, Restricted Project

Feb 10 2022

jaykang10 updated the diff for D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

Following the comment of @paulwalker-arm, a option is added.

Feb 10 2022, 5:17 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

There is an option -vectorizer-maximize-bandwidth to control the functionality in LV.

static cl::opt<bool> MaximizeBandwidth(
    "vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
    cl::desc("Maximize bandwidth when selecting vectorization factor which "
             "will be determined by the smallest type in loop."));

Oh sure, but that affects all targets. Whereas I thought here we're talking about finding a migration path to enable it by default for AArch64 only.

Feb 10 2022, 4:23 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

So I'm wondering why we don't just control the functionality with a default off command line flag. That way it's available for testing, including unit-tests, until such a point where code generation is at a point where is makes sense to default it to on. Is this a terribly idea?

Feb 10 2022, 3:53 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

I'm missing a bit of rationale for this change. There is an interplay between having a wider VF or having a larger interleave factor. For 128bit vectors, an add <4 x i64> %x, %y will be legalized into two adds. Conceptually this is similar to vectorizing with <2 x i64> and having an interleave-factor of 2. I can imagine that interleaving in the loop-vectorizer leads to better code, because it avoids issues around type legalisation and may provide more opportunities for other IR passes to optimize the IR or move things around. If we always choose a wider VF I wonder if that may lead to poorer codegen because of type-legalization.

Is there a specific example where it's clearly an improvement to have a wider VF? And would choosing a larger unroll-factor help those cases?

Feb 10 2022, 2:37 AM · Restricted Project, Restricted Project
jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

The performance improvement from benchmarks is as below.

Are the numbers SPEC scores and negative Improvement values regressions? If that's the case, it seems like for the benchmarks you shared the impact is negative overall?

Feb 10 2022, 2:02 AM · Restricted Project, Restricted Project

Feb 9 2022

jaykang10 updated the diff for D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

@paulwalker-arm As you can see, there are regression tests which are failed to generate scalable vector type with the shouldMaximizeVectorBandwidth.
Previously, I saw these regression tests were failed because of the cost so I returned false with SVE in shouldMaximizeVectorBandwidth...
I am not sure it is acceptable that LV selects VF 16 instead of scalable vector type on SVE...
If it is not acceptable, we could need to tune the cost on SVE and I would like to suggest to use shouldMaximizeVectorBandwidth for only neon until finishing the cost tune...

Feb 9 2022, 4:13 AM · Restricted Project, Restricted Project
jaykang10 added inline comments to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Feb 9 2022, 2:35 AM · Restricted Project, Restricted Project

Feb 7 2022

jaykang10 added a comment to D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.

Hello. This option makes a lot of sense to me, for the way AArch64 lowers vectors with extensions in them. I have a downstream set of benchmarks which, whilst not amazing, do show changes in vectorization quite well. There are some great improvements, but some things are not looking so healthy. I will send you some details. We may need to work through improving some of them, either by fixing the costs or improving the codegen for larger than legal types.

Feb 7 2022, 3:16 AM · Restricted Project, Restricted Project

Feb 4 2022

jaykang10 requested review of D118979: [AArch64] Set maximum VF with shouldMaximizeVectorBandwidth.
Feb 4 2022, 2:20 AM · Restricted Project, Restricted Project

Dec 4 2021

jaykang10 planned changes to D111806: [LICM] Check the number of divergent paths from loop header to target block.
Dec 4 2021, 12:42 AM · Restricted Project

Dec 3 2021

jaykang10 added a comment to D115029: [AArch64] Try to fold lsl + ldr/str.
Dec 3 2021, 4:03 AM · Restricted Project
jaykang10 requested review of D115029: [AArch64] Try to fold lsl + ldr/str.
Dec 3 2021, 1:31 AM · Restricted Project

Nov 11 2021

jaykang10 planned changes to D113656: [LICM] Check a heuristic case to hoist load.

Ah... @a could contains the address of @a... Thanks for pointing out it. @fhahn @nikic

Nov 11 2021, 3:38 AM · Restricted Project
jaykang10 added a reviewer for D113656: [LICM] Check a heuristic case to hoist load: asbirlea.
Nov 11 2021, 2:02 AM · Restricted Project
jaykang10 requested review of D113656: [LICM] Check a heuristic case to hoist load.
Nov 11 2021, 2:01 AM · Restricted Project

Nov 5 2021

jaykang10 committed rGa7b1872593db: [AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv (authored by jaykang10).
[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv
Nov 5 2021, 5:49 AM
jaykang10 closed D113263: [AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv.
Nov 5 2021, 5:48 AM · Restricted Project
jaykang10 updated the diff for D113263: [AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv.
Nov 5 2021, 5:03 AM · Restricted Project
jaykang10 added a comment to D113263: [AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv.

LGTM.
My only suggestion would be to drop the popcount test as the bug affects uaddlv4h_from_v8i8, but that's only a minor nit.

Nov 5 2021, 4:53 AM · Restricted Project
jaykang10 added a comment to D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv.

I have create a new review https://reviews.llvm.org/D113263.

Nov 5 2021, 4:39 AM · Restricted Project
jaykang10 requested review of D113263: [AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv.
Nov 5 2021, 4:38 AM · Restricted Project
jaykang10 added a comment to D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv.

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

A pattern has selected wrong uaddlv MI. It should be as below.
uaddv(uaddlp(v8i8)) ==> uaddlv(v8i8)

It would be better as a new phabricator review, but the update looks good to me.

Nov 5 2021, 4:34 AM · Restricted Project
jaykang10 added a comment to D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv.

@LemonBoy If possible, can you check the last update of this patch solves the error from your side please?

Nov 5 2021, 4:23 AM · Restricted Project
jaykang10 updated the diff for D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv.

[AArch64] Fix a bug from a pattern for uaddv(uaddlp(x)) ==> uaddlv

Nov 5 2021, 4:20 AM · Restricted Project
jaykang10 added a comment to D104236: [AArch64] Add a TableGen pattern to generate uaddlv from uaddlp and addv.

@LemonBoy I am sorry for bug... It was my mistake... As @efriedma mentioned, it should be v8i18 rather than v4i16. Let me update it.

Nov 5 2021, 3:51 AM · Restricted Project

Nov 1 2021

jaykang10 updated subscribers of D111034: [AArch64] Optimize add/sub with immediate.

Sorry, my patch could cause the problem. I have pushed a bug patch https://reviews.llvm.org/D109963.
Please wait a couple of days to verify the bug patch and then rebase your patch.

I have rebased on your newest code and uploaded my patch, it works without any crashes on my localhost.

So if there is no concerns within several days, I will commit my new patch and wait for blames (if there will be). ^_^

Thanks for rebase and sorry for adding more codes to the pass.

If possible, please check the bootstrap build and check-all on AArch64 machine to minimize the possibility of errors from AArch64 crews side. If you already did it, I am +1. :)

I do not have an aarch64 machine, maybe I need to buy a raspberry-pi or macbook pro m1 for that. :)

Nov 1 2021, 4:10 AM · Restricted Project
jaykang10 added a comment to D111034: [AArch64] Optimize add/sub with immediate.

Sorry, my patch could cause the problem. I have pushed a bug patch https://reviews.llvm.org/D109963.
Please wait a couple of days to verify the bug patch and then rebase your patch.

I have rebased on your newest code and uploaded my patch, it works without any crashes on my localhost.

So if there is no concerns within several days, I will commit my new patch and wait for blames (if there will be). ^_^

Nov 1 2021, 3:14 AM · Restricted Project

Oct 25 2021

jaykang10 committed rGa50243625930: [AArch64] Remove redundant ORRWrs which is generated by zero-extend (authored by jaykang10).
[AArch64] Remove redundant ORRWrs which is generated by zero-extend
Oct 25 2021, 1:49 AM
jaykang10 closed D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.
Oct 25 2021, 1:49 AM · Restricted Project
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

No error from bootstrap build and check-all on AArch64 machine.

Thanks. Knowing you were away for a few days I took the time to include this in some csmith testing which I happened to already be running for another reason. It didn't come up with any other problems related to this, which is a good sign (but doesn't mean something else might be wrong, just that csmith like code does OK). So LGTM.

Oct 25 2021, 1:09 AM · Restricted Project

Oct 20 2021

jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Following comment of @dmgreen, updated patch.

  • Cleared kill flag of source operand of ORRWrs.
Oct 20 2021, 9:24 AM · Restricted Project
jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Following comment of @dmgreen, updated patch.

Oct 20 2021, 7:00 AM · Restricted Project
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Do we need to remove Kill flags from the uses of the register we are replacing? From something like this:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-none-none-eabi"

@e = dso_local local_unnamed_addr global i16 0, align 4
@b = dso_local local_unnamed_addr global i32 0, align 4
@d = dso_local local_unnamed_addr global i32 0, align 4
@c = dso_local local_unnamed_addr global i32* null, align 8
@a = dso_local local_unnamed_addr global i32 0, align 4

define i32 @i() {
entry:
  %0 = load i32, i32* @b, align 4
  %1 = trunc i32 %0 to i16
  %conv1 = and i16 %1, 255
  %2 = load i32, i32* @d, align 4
  %tobool.not = icmp eq i32 %2, 0
  br i1 %tobool.not, label %if.end, label %if.then

  if.then:                                          ; preds = %entry
  %conv2 = zext i16 %conv1 to i64
  %3 = inttoptr i64 %conv2 to i32*
  store i32* %3, i32** @c, align 8
  br label %if.end

  if.end:                                           ; preds = %if.then, %entry
  %4 = load i32, i32* @a, align 4
  %5 = trunc i32 %4 to i16
  %conv4 = or i16 %conv1, %5
  store i16 %conv4, i16* @e, align 4
  ret i32 0
}
Oct 20 2021, 6:32 AM · Restricted Project
jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Following comment of @dmgreen, updated patch.

Oct 20 2021, 5:22 AM · Restricted Project
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Thanks.

I assume you have checked through the AArch64 pseudo instructions (the ones before ABS_ZPmZ_B in build/lib/Target/AArch64/AArch64GenInstrInfo.inc) and they look OK? They will follow the same rules of producing zeroed upper bits for W register definitions.

Oct 20 2021, 4:22 AM · Restricted Project

Oct 19 2021

jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

I have checked bootstrap build and run check-all with the build. It looks OK.

Oct 19 2021, 7:02 AM · Restricted Project
jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Following comment of @dmgreen, updated patch.

Oct 19 2021, 2:29 AM · Restricted Project
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Do we need to add an extra TS bit, or can we just use GENERIC_OP_END?

As far as I understand the opcodes are always in the order: [TargetOpcodes, G_ opcodes, A64 Pseudos, A64 instructions]. Do we need to rule out A64 pseudos? If so can we check isPseudo().

Oct 19 2021, 12:48 AM · Restricted Project

Oct 18 2021

jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Following the comment of @efriedma, updated patch

Oct 18 2021, 9:13 AM · Restricted Project
jaykang10 added a comment to D111034: [AArch64] Optimize add/sub with immediate.

Sorry, my patch could cause the problem. I have pushed a bug patch https://reviews.llvm.org/D109963.
Please wait a couple of days to verify the bug patch and then rebase your patch.

Oct 18 2021, 1:02 AM · Restricted Project
jaykang10 committed rG3f0b178de21e: [AArch64] Fixed a bug on AArch64MIPeepholeOpt (authored by jaykang10).
[AArch64] Fixed a bug on AArch64MIPeepholeOpt
Oct 18 2021, 12:56 AM
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Sorry for Ping.

I was looking a bit last week, but didn't get very far into proving this is correct. Eli's suggestions sound good to me, and having a nice big comment explaining why it is valid would be good too.

Oct 18 2021, 12:39 AM · Restricted Project
jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

@jaykang10

@danielkiss If possible, can you check the compiler crash on your side with latest update of this patch please?

Done, Chrome built with the latest patch. Thanks!

Oct 18 2021, 12:31 AM · Restricted Project

Oct 16 2021

jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

The check-all with bootstrap build is ok on AArch64 machine. The reduced case is also passed.

Oct 16 2021, 4:08 AM · Restricted Project

Oct 15 2021

jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

I'm generally worried about allowing unknown instructions here.

Any AArch64 instruction that produces a 32-bit result zeros the high bits, yes. But some MachineInstr opcodes aren't really instructions in this sense. You've noted EXTRACT_SUBREG, PHI, and COPY specifically. Probably missing IMPLICIT_DEF. Not sure if there are other relevant instructions; any pseudo-instruction is potentially an issue, but auditing them for AArch64, the target-specific ones mostly don't produce GPR32.

In any case, I'd be happier if we had a bit to check for a "real" instruction, in TSFlags or something like that. I don't want to worry about modifying this in the future if, for example, we end up with a FREEZE MachineInstr.

Oct 15 2021, 3:19 PM · Restricted Project
jaykang10 updated the diff for D109963: [AArch64] Split bitmask immediate of bitwise AND operation.
Oct 15 2021, 3:10 PM · Restricted Project
jaykang10 updated the diff for D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

Fixed a bug.

Oct 15 2021, 3:06 PM · Restricted Project
jaykang10 added inline comments to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.
Oct 15 2021, 2:39 PM · Restricted Project
jaykang10 added a comment to D111806: [LICM] Check the number of divergent paths from loop header to target block.

JFYI, Roman is correct. LICM is a canonicalizing transform. This patch will never be accepted. If the hoisting was not profitable, machine-licm exists precisely to undo it in a cost based way.

As to your point about the profile driven piece, you are correct, that is inconsistent. That should never have landed, or been enabled. I'm digging through the history to figure out how that happened.

Oct 15 2021, 1:03 AM · Restricted Project

Oct 14 2021

jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Sorry for Ping.

Oct 14 2021, 9:42 AM · Restricted Project
jaykang10 reclaimed D111806: [LICM] Check the number of divergent paths from loop header to target block.

In order to get more comments, I open this review again.

Oct 14 2021, 8:17 AM · Restricted Project
jaykang10 added a comment to D111806: [LICM] Check the number of divergent paths from loop header to target block.

I would like to mention one thing.

Oct 14 2021, 8:05 AM · Restricted Project
jaykang10 abandoned D111806: [LICM] Check the number of divergent paths from loop header to target block.

As it's stated in the pass description, it is a canonicalization transform, performed without a costmodel,
and it's expected to be undone in backend.

Oct 14 2021, 7:22 AM · Restricted Project
jaykang10 requested review of D111806: [LICM] Check the number of divergent paths from loop header to target block.
Oct 14 2021, 7:14 AM · Restricted Project

Oct 11 2021

jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Rebased

Oct 11 2021, 3:51 AM · Restricted Project

Oct 8 2021

jaykang10 added a reverting change for rGfc36fb4d23a5: Revert "Second Recommit "[AArch64] Split bitmask immediate of bitwise AND…: rG30caca39f401: Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation".
Oct 8 2021, 3:30 AM
jaykang10 committed rG30caca39f401: Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation" (authored by jaykang10).
Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
Oct 8 2021, 3:30 AM
jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

If there are still regressions with the update, please let me know.

Looks good on our side. I think you can reland now and see if anyone else still has failures with that.

Oct 8 2021, 3:11 AM · Restricted Project
jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

I am sorry for late. It took a time to find the problem and check the stage2 build again.

I was enable to reproduce the regressions with stage2 build. I have fixed them and I can see the regressions are gone.

@DavidSpickett @leonardchan @zatrazz If possible, can you check the regressions with latest update of this patch please?

If there are still regressions with the update, please let me know.

Oct 8 2021, 1:22 AM · Restricted Project
jaykang10 committed rG4c98070cce2a: [LoopBoundSplit] Handle the case in which exiting block is loop header (authored by jaykang10).
[LoopBoundSplit] Handle the case in which exiting block is loop header
Oct 8 2021, 1:15 AM
jaykang10 closed D110060: [LoopBoundSplit] Handle the case in which exiting block is loop header.
Oct 8 2021, 1:15 AM · Restricted Project

Oct 7 2021

jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

I am sorry for late. It took a time to find the problem and check the stage2 build again.

Oct 7 2021, 1:49 PM · Restricted Project
jaykang10 updated the diff for D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

Fixed a bug

Oct 7 2021, 1:42 PM · Restricted Project

Oct 6 2021

jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

Sorry, let me try to reproduce the errors.

Oct 6 2021, 12:54 AM · Restricted Project

Oct 5 2021

jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

I've also reproduced it. Since check stage 1 passes my guess is that the clang built in stage 1 is emitting incorrect code.

Oct 5 2021, 7:36 AM · Restricted Project
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

At instruction selection level, we check the operations which do not zero-out the high half of the 64-bit register using isDef32 function as below.

// Any instruction that defines a 32-bit result zeros out the high half of the
// register. Truncate can be lowered to EXTRACT_SUBREG. CopyFromReg may
// be copying from a truncate. But any other 32-bit operation will zero-extend
// up to 64 bits. AssertSext/AssertZext aren't saying anything about the upper
// 32 bits, they're probably just qualifying a CopyFromReg.
static inline bool isDef32(const SDNode &N) {
  unsigned Opc = N.getOpcode();
  return Opc != ISD::TRUNCATE && Opc != TargetOpcode::EXTRACT_SUBREG &&
         Opc != ISD::CopyFromReg && Opc != ISD::AssertSext &&
         Opc != ISD::AssertZext && Opc != ISD::AssertAlign &&
         Opc != ISD::FREEZE;
}

As you can see, the isDef32 checks fundamentally ISD::TRUNCATE within a basic block because the below pattern is matched with the ISD::TRUNCATE and EXTRACT_SUBREG does not guarantee the high 32 bits are zero.

// To truncate, we can simply extract from a subregister.
def : Pat<(i32 (trunc GPR64sp:$src)),
          (i32 (EXTRACT_SUBREG GPR64sp:$src, sub_32))>;

This patch checks the ORRWrs has EXTRACT_SUBREG in different basic block as operand because the existing pattern with isDef32 resolves case in which the operand is in same basic block.

That explains things in terms of DAG-ISel, but there are other instruction selectors and different optimization between then and here. (Plus the isDef32 has had so many bugs it's difficult to trust!)

We know that all (?) instructions that generate a W register under AArch64 will zero the upper bits of the X register. We seems to say in this patch that certain EXTRACT_SUBREG are not valid, COPY and PHI are currently excluded. Is that really all we have to worry about? Do we know that the top bits are always 0 for all other grp32 sources?

Oct 5 2021, 6:30 AM · Restricted Project
jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

um... I checked the bootstrap build before pushing this patch. Let me check it again.

Oct 5 2021, 4:32 AM · Restricted Project
jaykang10 added a comment to D109963: [AArch64] Split bitmask immediate of bitwise AND operation.

This has caused a regression on stage2 buildbot [1], more specifically the test FAIL: Clang::p4-0x.cpp (but I think the other others might be related).

[1] https://lab.llvm.org/buildbot/#/builders/185/builds/543

Oct 5 2021, 4:17 AM · Restricted Project
jaykang10 added a comment to D111034: [AArch64] Optimize add/sub with immediate.

It would be good to wait for review from others.

Oct 5 2021, 3:47 AM · Restricted Project

Oct 4 2021

jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Following comments of @dmgreen, updated patch.

  • Checked WZR
  • Added MIR tests
Oct 4 2021, 8:38 AM · Restricted Project
jaykang10 added a comment to D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Can you explain in more details what makes this valid? Does it depend on the top bits already being zero? What verifies that?

It might be useful to add mir tests too, to test specific cases that should/shouldn't be removed.

Oct 4 2021, 7:47 AM · Restricted Project
jaykang10 added inline comments to D110060: [LoopBoundSplit] Handle the case in which exiting block is loop header.
Oct 4 2021, 5:27 AM · Restricted Project
jaykang10 updated the diff for D110060: [LoopBoundSplit] Handle the case in which exiting block is loop header.

Following comments of @mkazantsev, updated patch

  • Split the patch into two
  • Find PHI with exiting condition from pre-loop correctly
Oct 4 2021, 5:19 AM · Restricted Project
jaykang10 added inline comments to D110060: [LoopBoundSplit] Handle the case in which exiting block is loop header.
Oct 4 2021, 3:07 AM · Restricted Project
jaykang10 committed rG4288b6520a8e: [LoopBoundSplit] Use SCEVAddRecExpr instead of SCEV for AddRecSCEV (NFC) (authored by jaykang10).
[LoopBoundSplit] Use SCEVAddRecExpr instead of SCEV for AddRecSCEV (NFC)
Oct 4 2021, 2:18 AM
jaykang10 added inline comments to D110060: [LoopBoundSplit] Handle the case in which exiting block is loop header.
Oct 4 2021, 2:15 AM · Restricted Project
jaykang10 added a comment to D109682: [LoopBoundSplit] Check the condition of the first iteration in pre-loop using isLoopEntryGuardedByCond.

https://reviews.llvm.org/D110060 The patch is split into two patches. As one of them, this patch is merged.

Oct 4 2021, 2:15 AM · Restricted Project

Oct 1 2021

jaykang10 updated the diff for D110841: [AArch64] Remove redundant ORRWrs which is generated by zero-extend.

Fixed a bug

  • replaceRegWith changes MI's defintion register. Keep it for SSA form until deleting MI.
Oct 1 2021, 3:47 AM · Restricted Project