Page MenuHomePhabricator

[X86] Individually simplify both operands of PMULDQ/PMULUDQ using the other entry point of SimplifyDemandedBits that allows the one use check of the root node to be suppressed.
AbandonedPublic

Authored by craig.topper on Dec 22 2018, 9:57 PM.

Details

Summary

This entry point takes special care to avoid completely replacing the root node if it has multiple uses. Instead it will just use UpdateNodeOperands to only update the PMULDQ/PMULUDQ node. Not sure if that can create additional instructions in some cases.

This allows masking and sign_extend_inreg opcodes to be removed from the input of these operations.

Fixes PR40142.

Diff Detail

Event Timeline

craig.topper created this revision.Dec 22 2018, 9:57 PM

Its really annoying that we have to do this but I can't see any other way.

lib/Target/X86/X86ISelLowering.cpp
41275

Add a comment explaining what we're we having to do here.

41278

Is it worth moving this code inside SimplifyDemandedBitsForTargetNode?

Wrap this in a for loop to reduce code?

for {int OpIdx  = 0; OpIdx != 2; ++OpIdx) {
    TargetLowering::TargetLoweringOpt TLO(DAG, !DCI.isBeforeLegalize(),
                                          !DCI.isBeforeLegalizeOps());
    if (TLI.SimplifyDemandedBits(N, OpIdx, DemandedMask, DCI, TLO)) {
      DCI.AddToWorklist(N);
      return SDValue(N, 0);
    }
}
craig.topper marked an inline comment as done.

Use for loop. Add comment.

I am a little suspicious about these changes to the test results.

  1. If these removals are all of redundant instructions that were added previously, I worry about how well people are checking the test result output. A large chunk of the code we are removing is from this: https://reviews.llvm.org/rL347181. As you can see, there are major changes to the test results, which, unless I am mistaken, passed without questioning whether these changes are a regression or an improvement. If nobody checks test results for regressions, why have tests at all?
  2. I also worry that they might not return the correct result. I am especially concerned about the number of removed i64 vector multiplies.
  3. Otherwise, if these all return the correct result, nice job.

I am a little suspicious about these changes to the test results.

  1. If these removals are all of redundant instructions that were added previously, I worry about how well people are checking the test result output. A large chunk of the code we are removing is from this: https://reviews.llvm.org/rL347181. As you can see, there are major changes to the test results, which, unless I am mistaken, passed without questioning whether these changes are a regression or an improvement. If nobody checks test results for regressions, why have tests at all?

I think you'e referring to pmul.ll? rL347181 just replaced psrad $31 with pcmpgt. It changed the register allocation quite a bit which made the diff look large. But it didn't change the number of instructions other than copies and xors.

  1. I also worry that they might not return the correct result. I am especially concerned about the number of removed i64 vector multiplies.

I'll take another look.

  1. Otherwise, if these all return the correct result, nice job.
lib/Target/X86/X86ISelLowering.cpp
41278

I don't think that will work given the current implementation AssumeSingleUse. The code that does the replacement assumes that a node with multiple uses is the root node of the simplification.

RKSimon accepted this revision.Dec 23 2018, 2:22 PM

LGTM - thanks

This revision is now accepted and ready to land.Dec 23 2018, 2:22 PM

Looking at the vector-reduce-mul.ll changes, I don't think this patch is valid. If the first node has multiple uses we can't propagate the input demanded bits to the next operation down.

craig.topper planned changes to this revision.Dec 23 2018, 3:22 PM

Looking at the vector-reduce-mul.ll changes, I don't think this patch is valid. If the first node has multiple uses we can't propagate the input demanded bits to the next operation down.

Yeah, I checked it. It's definitely broken.

Although, @RKSimon, that is a perfect example of what I was saying, not checking tests. You almost accepted a patch that broke multivector 64-bit multiplication completely.

I transpiled the assembly from the test results and ran a diff, and sure enough, broken.

    .text
    .globl test_v4i64
// long long test_v4i64(i64x2 xmm0, i64x2 xmm1)
test_v4i64:
#ifndef NEW
    movdqa %xmm0, %xmm2
    psrlq $32, %xmm2
    pmuludq %xmm1, %xmm2
    movdqa %xmm1, %xmm3
    psrlq $32, %xmm3
    pmuludq %xmm0, %xmm3
    paddq %xmm2, %xmm3
    psllq $32, %xmm3
#endif
    pmuludq %xmm1, %xmm0
#ifndef NEW
    paddq %xmm3, %xmm0
#endif
    pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
    movdqa %xmm0, %xmm2
    psrlq $32, %xmm2
    pmuludq %xmm1, %xmm2
    movdqa %xmm0, %xmm3
    psrldq $12, %xmm3
    pmuludq %xmm0, %xmm3
    paddq %xmm2, %xmm3
    psllq $32, %xmm3
    pmuludq %xmm1, %xmm0
    paddq %xmm3, %xmm0
    movq %xmm0, %rax
    retq
#include <stdio.h>

typedef long long i64x2 __attribute__((vector_size(16)));
extern long long test_v4i64(i64x2 val1, i64x2 val2); // GCC passes a pointer if I emulate an i64x4

int main(void) {
    i64x2 test = { 1234567812345678LL, 2345678923456789LL },
          test2 = { 1314151617181910LL, 9694959695969798LL };
     printf ("%lld\n", test_v4i64(test, test2));
}

Compiled with GCC to avoid any LLVM-specific issues.

-UNEW: -7524135448842347520
-DNEW: 1393403030700022784

The really concerning thing is that this patch does the exact same thing that's done by simplifyI24 in AMDGPUISelLowering.cpp. I assume Simon like myself assumed that established infrastructure like that was doing the right thing and didn't question it too much.

The really concerning thing is that this patch does the exact same thing that's done by simplifyI24 in AMDGPUISelLowering.cpp. I assume Simon like myself assumed that established infrastructure like that was doing the right thing and didn't question it too much.

Ok. Well, unfortunately, that too is probably broken.

I would test it myself, but I only have integrated on a Sandy Bridge laptop.

The good thing is that we are catching it here. It could've been a lot worse. If this got into x86 world, it would go from a visual glitch to someone getting a bill for $4,294,967,295 because of an arithmetic bug in the compiler.

Sorry if I came off sounding like a jerk, by the way.

I understand that there are a lot of patches that come in, and checking the thousands of tests is time consuming, and I don't expect every single test to be read line for line.

I was just a little shocked when I saw this patch, with hundreds of lines removed from the test results, pass review without question, even after I had already expressed my concerns with it.

I'm sorry about that, I focussed on checking the earlier (and by looks of it easier) test cases - SimplifyDemandedBits patches tend to cause a lot of changes with a lot of interesting cases where it can throw away a lot of code. So I was expecting some changes like that and didn't look at them more critically.

The splitting version of SimplifyDemandedBits now scares me - I think the issue is that the TLO.Old.hasOneUse() code doesn't take into account that it might have gone through several other ops that had multiple uses but I haven't checked this yet.

The AssumeSingleUse code will only allow the root to have multiple uses. The issue is that if the root has multiple uses, the most you should be allowed to do is decide if the root node is necessary for the instruction that is using it and replace it with one of it operands in this use. No new nodes should be created. You cannot recurse into SimplifyDemandedBits any further. You can only use computeKnownBits to make the decision.

This is analogous to the code in InstCombines’s SimplifyMultipleUseDemandedBits.

craig.topper abandoned this revision.Dec 24 2018, 6:58 PM

Abandoning in favor of the implementation from rL350059