@jonpa Without the change to knownbits.ll then the TM instruction reappears contrary to the comment - is this OK?
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/CodeGen/X86/known-bits-vector.ll | ||
---|---|---|
173–177 | We lose the ability to fold this entire computation into a constant. After the AND+OR sequence, element 2 and 3 of %xmm0 are known at compile time (i.e. those are both value 65535). To be fair, we could even shrink the constant pool entry by using a vbroadcastss instead of a vmovaps (on AVX). |
test/CodeGen/X86/known-bits-vector.ll | ||
---|---|---|
173–177 | The problem here is that SimplifyDemandedElts runs first and updates the and/or to: %1 = and <4 x i32> %a0, <i32 undef, i32 undef, i32 255, i32 4085> %2 = or <4 x i32> %1, <i32 65535, i32 65535, i32 65535, i32 65535> // broadcasts are preserved Then SimplifyDemandedBits runs, doesn't know about which vector elements are needed, and so can't fold to: uitofp <4 x i32><i32 65535, i32 65535, i32 65535, i32 65535> to <4 x float> for constant folding as it used to do. I may be able to attempt to constant fold more aggressively in SimplifyDemandedElts, but failing that the best option going forward for this kind of regression would be to merge SimplifyDemandedBits and SimplifyDemandedElts into a single pass, matching what ComputeNumBits does with scalar/vectors. Also, the original purpose of this test wasn't to constant fold but to recognise that the uitofp could be simplified to sitofp (so x86 could use cvtdq2ps). |
test/CodeGen/X86/known-bits-vector.ll | ||
---|---|---|
173–177 | If fixing this regression is not simple, then can raise a bug for it and work on it later. What do you think? | |
test/CodeGen/X86/packss.ll | ||
259–263 | Unrelated to this patch. This may be hard to catch... On AVX2 and AVX, we could probably simplify it to this (didn't verify that the shuffle mask is correct): vpslld $31, %ymm0, %ymm0 vpsrad $31, %ymm0, %ymm0 vpshufd $1, %ymm0, %ymm0 vextractf128 $1, %ymm0, %xmm1 vpackssdw %xmm1, %xmm0, %xmm0 That would require quite a lot of knowledge about both demanded bits and demanded elts. Also, it requires that we sink the bitcast in the shift operands, and then we shuffle elements after. This may be something worthy to investigate in future.. |
test/CodeGen/X86/packss.ll | ||
---|---|---|
259–263 | SimplifyDemandedVectorElts doesn't handle shifts yet - its on the list, but as you can see just OR/XOR support causes a lot a diffs! |
test/CodeGen/X86/packss.ll | ||
---|---|---|
259–263 | Yeah I noticed :-). Anwyay, I trust your judgment on this. |
@jonpa Are you OK with the systemz knownbits.ll change please?
test/CodeGen/X86/known-bits-vector.ll | ||
---|---|---|
173–177 | I've raised https://bugs.llvm.org/show_bug.cgi?id=40000 which should cover this kind of issue. | |
test/CodeGen/X86/packss.ll | ||
259–263 | Again https://bugs.llvm.org/show_bug.cgi?id=40000 might help here (only demand the signbit from some elements) |
We lose the ability to fold this entire computation into a constant.
After the AND+OR sequence, element 2 and 3 of %xmm0 are known at compile time (i.e. those are both value 65535).
The vector permute can therefore be folded away. And we can perform the int2fp conversion at compile time. Effectively folding away the entire computation into a load from constant pool.
To be fair, we could even shrink the constant pool entry by using a vbroadcastss instead of a vmovaps (on AVX).