This is an archive of the discontinued LLVM Phabricator instance.

[TargetLowering] Add ISD::OR + ISD::XOR handling to SimplifyDemandedVectorElts
ClosedPublic

Authored by RKSimon on Dec 12 2018, 7:08 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Dec 12 2018, 7:08 AM
andreadb added inline comments.Dec 12 2018, 10:03 AM
test/CodeGen/X86/known-bits-vector.ll
173–177

We lose the ability to fold this entire computation into a constant.

After the AND+OR sequence, element 2 and 3 of %xmm0 are known at compile time (i.e. those are both value 65535).
The vector permute can therefore be folded away. And we can perform the int2fp conversion at compile time. Effectively folding away the entire computation into a load from constant pool.

To be fair, we could even shrink the constant pool entry by using a vbroadcastss instead of a vmovaps (on AVX).

RKSimon added inline comments.Dec 13 2018, 2:21 AM
test/CodeGen/X86/known-bits-vector.ll
173–177

The problem here is that SimplifyDemandedElts runs first and updates the and/or to:

%1 = and <4 x i32> %a0, <i32 undef, i32 undef, i32 255, i32 4085>
%2 = or <4 x i32> %1, <i32 65535, i32 65535, i32 65535, i32 65535> // broadcasts are preserved

Then SimplifyDemandedBits runs, doesn't know about which vector elements are needed, and so can't fold to:

uitofp <4 x i32><i32 65535, i32 65535, i32 65535, i32 65535> to <4 x float>

for constant folding as it used to do.

I may be able to attempt to constant fold more aggressively in SimplifyDemandedElts, but failing that the best option going forward for this kind of regression would be to merge SimplifyDemandedBits and SimplifyDemandedElts into a single pass, matching what ComputeNumBits does with scalar/vectors.

Also, the original purpose of this test wasn't to constant fold but to recognise that the uitofp could be simplified to sitofp (so x86 could use cvtdq2ps).

andreadb added inline comments.Dec 13 2018, 5:43 AM
test/CodeGen/X86/known-bits-vector.ll
173–177

If fixing this regression is not simple, then can raise a bug for it and work on it later. What do you think?

test/CodeGen/X86/packss.ll
259–263

Unrelated to this patch.

This may be hard to catch...

On AVX2 and AVX, we could probably simplify it to this (didn't verify that the shuffle mask is correct):

vpslld    $31, %ymm0, %ymm0
vpsrad  $31, %ymm0, %ymm0
vpshufd $1, %ymm0, %ymm0
vextractf128 $1, %ymm0, %xmm1
vpackssdw %xmm1, %xmm0, %xmm0

That would require quite a lot of knowledge about both demanded bits and demanded elts. Also, it requires that we sink the bitcast in the shift operands, and then we shuffle elements after.

This may be something worthy to investigate in future..

RKSimon marked an inline comment as done.Dec 13 2018, 6:30 AM
RKSimon added inline comments.
test/CodeGen/X86/packss.ll
259–263

SimplifyDemandedVectorElts doesn't handle shifts yet - its on the list, but as you can see just OR/XOR support causes a lot a diffs!

andreadb added inline comments.Dec 13 2018, 6:44 AM
test/CodeGen/X86/packss.ll
259–263

Yeah I noticed :-).

Anwyay, I trust your judgment on this.
My understanding is that your plan is to keep working on improving this area. If that's the case, then I am happy if you file a bug to track the progress on fixing that particular regression, so that this change can be committed.

RKSimon marked an inline comment as not done.Dec 14 2018, 12:25 AM

@jonpa Are you OK with the systemz knownbits.ll change please?

test/CodeGen/X86/known-bits-vector.ll
173–177

I've raised https://bugs.llvm.org/show_bug.cgi?id=40000 which should cover this kind of issue.

test/CodeGen/X86/packss.ll
259–263

Again https://bugs.llvm.org/show_bug.cgi?id=40000 might help here (only demand the signbit from some elements)

I leave the review of the SystemZ test to Uli.

This revision is now accepted and ready to land.Dec 14 2018, 3:20 AM

I leave the review of the SystemZ test to Uli.

Yes, this LGTM as well.

This revision was automatically updated to reflect the committed changes.