User Details
- User Since
- Jul 15 2015, 3:50 PM (363 w, 1 d)
Tue, Jun 28
Mon, Jun 27
This optimization doesn't depend on D125588, it was the test case. In the original test case there are several operands come from COPY directly, and wrong latency is computed for them. In current version I added several instructions so no operands of SUB/ADD come from COPY instructions.
Fri, Jun 24
I think there are more similar useful patterns, but I encountered this pattern only. We can add other patterns when we have tests.
Add mir tests.
Wed, Jun 22
The improvement of cost model has been split out. Update this patch to contain the new pattern only.
Tue, Jun 21
It sounds to me like the MADD is decoded into two uops, mul and add. Then the mul can be immediately executable once the two multipliers are available. Is this correct?
Fri, Jun 17
Mon, Jun 13
ping
Mon, Jun 6
The test was run on a neoverse n1 machine. I didn't specify any -mcpu, so I guess it is -mcpu=generic.
May 31 2022
I tried SPEC2006 on a AArch64 machine. But for some unknown problem I couldn't run 464.h264ref correctly, even gcc compiled code generates same output.
May 23 2022
I tested SPEC2006 int on my skylake desktop.
May 16 2022
@rampitec, any follow up on this patch?
May 13 2022
May 12 2022
I did the MachineCombiner cost model improvement because it impacts my aarch64 test cases. But as you said it's reasonable to sent it as a separate patch. I will do that.
May 11 2022
May 10 2022
Any comments from other reviewers?
May 6 2022
ping
May 2 2022
https://github.com/llvm/llvm-project/issues/55237 is filed.
We can move discussion to there.
Apr 28 2022
I added another GVNHoist pass after loop optimizations, then the two "or" instructions are replaced by one "or" instruction and hoisted into PreHeader2, and finally instructions in LoopExit2 got vectorized by SLPVectorizer.
Apr 27 2022
Apr 26 2022
If I specify -rotation-max-header-size=2, loop rotation can be disabled for Loop2, and the second LICM hoists the "or" instruction to PreHeader2, which dominates following "or" instruction. And later GVN and SLPVectorizer works as previously, I got vectorized instructions for LoopBody2.
Apr 14 2022
Apr 13 2022
Apr 12 2022
Now I understand how this patch caused missing vectorization in our code. In my previous comment I have analyzed that different GVN result caused different SLPVectorization behavior. This time let's focus on how this patch generates different GVN results.
Apr 4 2022
I'm thinking of the following change, so we can remove PHI instruction without introduce any extra instructions on any path.
BB1: br %cond, label %BB2, label %BB3 BB2: ... %161 = or i64 %24, 1 ... br label BBX BB3: ... %179 = or i64 %24, 1 ... br label BBX BBX: // our interesting bb %245 = phi i64 [ %161, %BB2 ], [ %179, %BB3 ] ... %296 = mul nsw i64 %245, %5, !dbg !2155 %297 = getelementptr inbounds float, float* %4, i64 %296, !dbg !2156 ... ==> BB1: ... // New instruction is inserted here %245 = or i64 %24, 1 br %cond, label %BB2, label %BB3 BB2: ... ... br label BBX BB3: ... // Use of %245. ... br label BBX BBX: // our interesting bb // PHI instruction is deleted. ... %296 = mul nsw i64 %245, %5, !dbg !2155 %297 = getelementptr inbounds float, float* %4, i64 %296, !dbg !2156 ...That kinda sounds like something for GVNHoist, which i think is currently still disabled due to some miscompilations?
There are two ways to view this;
- if all of the IV's of PHI are fully identical instructions with fully identical operands, then we don't need to PHI together the operands, and can replace the PHI with said instruction.
- The one-user check there is there to ensure that the instruction count does not increase, so in principle, if we need to PHI together the operands, we need as many of the instructions to be one-user as many PHI's we need.
Apr 1 2022
I still failed to reproduce it in plain mode.
Mar 25 2022
Could you also share the input source file to reproduce the difference? I think this will be needed to investigate the difference.
Before this patch I have a code snippet
│ 80: mov -0x58(%rsp),%rdx │ mov -0x60(%rsp),%r9 │ mov -0x40(%rsp),%r15 0.63 │ 8f: mulps %xmm8,%xmm9 0.44 │ movups (%rdx,%r9,4),%xmm1 0.36 │ addps %xmm9,%xmm1 0.25 │ movups %xmm1,(%rdx,%r9,4) 0.75 │ mulps %xmm8,%xmm3 0.41 │ movups (%rdx,%r10,4),%xmm1 0.86 │ addps %xmm3,%xmm1 0.70 │ movups %xmm1,(%rdx,%r10,4) 0.43 │ add $0x8,%r9 0.28 │ add -0x48(%rsp),%rbx 0.07 │ cmp %r15,%r9 0.14 │ ↓ jge 3e5 ...
After this patch, it is changed to
0.33 │ 80: mov %r12,%rdx 0.32 │ or $0x1,%rdx 0.32 │ mov %r12,%rdi 0.39 │ or $0x2,%rdi 0.38 │ mov %r12,%rcx 0.30 │ or $0x3,%rcx 0.37 │ mov %r12,%rbp 0.39 │ or $0x4,%rbp 0.31 │ mov %r12,%r10 0.27 │ or $0x5,%r10 0.31 │ mov %r12,%r9 0.37 │ or $0x6,%r9 0.29 │ mov %r12,%r8 0.35 │ or $0x7,%r8 0.34 │ b1: mulss %xmm8,%xmm13 0.39 │ addss (%r11,%r12,4),%xmm13 0.33 │ movss %xmm13,(%r11,%r12,4) 0.33 │ mulss %xmm8,%xmm12 0.41 │ addss (%r11,%rdx,4),%xmm12 0.38 │ movss %xmm12,(%r11,%rdx,4) 0.31 │ mulss %xmm8,%xmm3 0.39 │ addss (%r11,%rdi,4),%xmm3 0.35 │ movss %xmm3,(%r11,%rdi,4) 0.41 │ mulss %xmm8,%xmm4 0.31 │ addss (%r11,%rcx,4),%xmm4 0.31 │ movss %xmm4,(%r11,%rcx,4) 0.34 │ mulss %xmm8,%xmm5 0.41 │ addss (%r11,%rbp,4),%xmm5 0.34 │ movss %xmm5,(%r11,%rbp,4) 0.32 │ mulss %xmm8,%xmm6 0.34 │ addss (%r11,%r10,4),%xmm6 0.38 │ movss %xmm6,(%r11,%r10,4) 0.35 │ mulss %xmm8,%xmm7 0.43 │ addss (%r11,%r9,4),%xmm7 0.38 │ movss %xmm7,(%r11,%r9,4) 0.41 │ mulss %xmm8,%xmm1 0.36 │ addss (%r11,%r8,4),%xmm1 0.32 │ movss %xmm1,(%r11,%r8,4) 0.39 │ add $0x8,%r12 0.39 │ add -0x18(%rsp),%rbx 0.02 │ cmp -0x60(%rsp),%r12 0.31 │ ↓ jge 510 ...
@wsmoses, this patch caused eigen regression in our code, it seems some originally vectorized code changed to scalarized code.
Could you revert it?
Mar 22 2022
ping
Mar 14 2022
ping
Mar 7 2022
ping
Feb 24 2022
Reformat.
Feb 16 2022
Feb 15 2022
Feb 4 2022
Dec 29 2021
ping
Dec 21 2021
Dec 20 2021
Dec 9 2021
I'm trying to understand the problem in this pass.
It looks both of @thopre's and @lebedev.ri's problems are loop related. Is there any other regression not loop related?
Dec 8 2021
Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.
Dec 6 2021
Finally complain comes.
Nov 29 2021
Nov 22 2021
ping
Nov 15 2021
Nov 12 2021
ping
Nov 9 2021
Rebase.
Nov 4 2021
Oct 28 2021
Oct 26 2021
Thank pengfei's question, I did find another case that DistanceMap is not maintained correctly when unfolding memory operand.
Oct 25 2021
Oct 21 2021
Oct 20 2021
Oct 19 2021
Add a mir test.
Oct 18 2021
I encountered this problem when I was working on a different optimization. It causes real problem in my work, but I can't find a test case for current code base.
Oct 14 2021
Oct 11 2021
Oct 8 2021
ping
Oct 1 2021
I ran spec2006 on a skylake desktop, the result is 38.2 vs 38.3, so no difference.
I also checked the overall impact on impacted test case
Sep 29 2021
rebase.
Any other comments?
Sep 27 2021
Sep 23 2021
Sep 22 2021
Sep 21 2021
Sep 20 2021
Sep 17 2021
Rebase.
Sep 10 2021
ping
Sep 3 2021
Sep 2 2021
Sep 1 2021
ping
Aug 25 2021
Jul 28 2021
Jul 19 2021
Jul 16 2021
Jul 15 2021
I successfully bootstrapped a stage2 clang, also I tested stage2-check-all without regression.
Jul 8 2021
ping
Jul 1 2021
ping.
Thanks for the fix!
The changes to CreateInsertElement looks good to me. But I'm not confident to review other parts of the patch.
Jun 30 2021
Alive does not agree with your analysis. Also, it is not SLP who merges undefs into poison, Builder.CreateInsertElement does this magic. Also, https://llvm.org/docs/LangRef.html#poison-values says that phis do not depend on the operands, so phi is not poisoned.
Thanks for the pointer, from the LLVM IR dump, the poison value was generated after SLP pass, we may dig it further.
The problem is exactly in the merging of two undef into poison.
Jun 29 2021
Compile the following test case with