Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/Mips/cconv/vector.ll | ||
---|---|---|
695–756 | There are some regressions in this file but also some improvements. I haven't worked out what's going on yet. | |
llvm/test/CodeGen/Thumb2/mve-vecreduce-mla.ll | ||
276–283 | Regression here and in other cases that are now using muls instead of umull/umlal. | |
llvm/test/CodeGen/X86/vector-fshl-128.ll | ||
188–189 ↗ | (On Diff #295270) | Regression. Quite a few tests are now using pxor+punpckhdq instead of pshufd. I wonder if some kind of combine could spot this case and turn it back into pshufd. |
llvm/test/CodeGen/AMDGPU/sdiv64.ll | ||
---|---|---|
502 ↗ | (On Diff #296629) | Is this a regression? It looks like we're lost track that we only need 1 element |
llvm/test/CodeGen/ARM/func-argpassing-endian.ll | ||
106 ↗ | (On Diff #296629) | regression? we're no longer doing a single multiple load |
llvm/test/CodeGen/ARM/vdup.ll | ||
59–69 | poor duplicate/splat detection? |
llvm/test/CodeGen/X86/vector-fshr-128.ll | ||
---|---|---|
188–189 ↗ | (On Diff #296629) | Hi, during combination before legalizeDAG. there are the following Nodes: t63: i32 = extract_vector_elt t62, Constant:i32<2> t64: i32 = extract_vector_elt t62, Constant:i32<3> t66: v4i32 = BUILD_VECTOR t63, t64, undef:i32, undef:i32 SimplifyDemandedBits deduce that t64 is a zero, so t66 will transformed into a pxor and a punpckhqd instead of a single pshufd. Maybe in such a case where a buildvector have elts which are extracted from the same vector, you shouldn't SimplifyDemandedBits for extract_vector_elt. |
llvm/test/CodeGen/X86/buildvec-insertvec.ll | ||
---|---|---|
783 | The value being stored is not undef, it's either -2147483648 or poison, depending on the value of %a0. Anyway -simplifycfg would change the store into a trap + unreachable, but nothing in llc's codegen pipeline does that. | |
llvm/test/CodeGen/X86/vec_setcc.ll | ||
223 | What simplification are you expecting? |
llvm/test/CodeGen/X86/buildvec-insertvec.ll | ||
---|---|---|
783 | I've looked into this more carefully now. Hopefully this answer makes more sense. With my patch %4 = extractelement <4 x i32> zeroinitializer, i32 %2 is simplified into i32 0 based on the known bits of all elements. This happens before we simplify %2, because of the weird way that the DAG combiner runs top-down. When we visit %2 we simplify it to -2147483648. After that, if we visited the original %4 again, we would simplify it to undef; but with my patch we have already simplified %4 to 0 so it's too late. I'm not sure what to do about this -- other than change DAGCombine to run bottom-up ;-) |
llvm/test/CodeGen/X86/buildvec-insertvec.ll | ||
---|---|---|
783 | Do we need to tweak the out-of-range handling to ISD::EXTRACT_VECTOR_ELT indices? Either just for constant indices or we use computeKnownBits to work out if the minimum value always the exceeds the vector element count? |
Rebase. There are two failing tests that I have not updated yet:
Failed Tests (2):
LLVM :: CodeGen/AMDGPU/cttz_zero_undef.ll LLVM :: CodeGen/AMDGPU/scratch-simple.ll
Maybe in such a case where a buildvector have elts which are extracted from the same vector, you shouldn't SimplifyDemandedBits for extract_vector_elt.
Yes, I am coming round to this idea: don't simplify EXTRACT_VECTOR_ELT to a constant if it is used by BUILD_VECTOR, because of the risk of breaking things like shuffle patterns.
llvm/test/CodeGen/X86/nontemporal-3.ll | ||
---|---|---|
2–8 | Please can you add a common CHECK prefix: ; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s --check-prefixes=CHECK,SSE ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4a | FileCheck %s --check-prefixes=CHECK,SSE ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=CHECK,SSE ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=CHECK,AVX ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefixes=CHECK,AVX ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512dq,+avx512vl | FileCheck %s --check-prefixes=CHECK,AVX512 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl | FileCheck %s --check-prefixes=CHECK,AVX512 | |
llvm/test/CodeGen/X86/vec_setcc.ll | ||
223 | The 0'th index should be able to extract from the source of the _EXTEND_VECTOR_INREG using SimplifyMultipleUseDemandedBits |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
19804 | How come this isn't picking up many of these test cases? Is it being run too late? |
llvm/test/CodeGen/X86/vec_setcc.ll | ||
---|---|---|
223 | Well %eax is extracted directly from the result of the pcmpeqw, which is the source of the sign_extend_vector_inreg. |
There are some outstanding lit test problems:
CodeGen/AMDGPU/bug-v4f64-subvector.ll: timeout
CodeGen/AMDGPU/scratch-simple.ll: fails
CodeGen/Hexagon/autohvx/hfnosplat_cp.ll: fails
CodeGen/Thumb2/mve-sext-masked-load.ll: fails machine verification for the -early-live-intervals RUN line
TBH I had rather gone off this patch, because it seemed like it would interfere with tests where we want to recombine a bunch of extracts back into some kind of permute instruction. (If you see a bunch of extracts from the same source then you might try to do this, but if one of the extracts has been folded to a constant then it's much harder to spot.) But I have rebased it anyway.
@deadalnix This patch has a number of outstanding problems, and might not be worth it - but can you tell if it helps D127115 at all?
So it definitively affects the codegen, but it's not clear if this is better or worse.
Yeah that seems about right
@foad its up to you if you want to persevere, we already have SimplifyDemandedBits support for cases where all uses of the vector are extracts - it looks like we now know why it was never extended beyond that...
This clearly needs work. The original motivation was to be able to remove SIFoldOperands::tryFoldCndMask which is a MIR optimisation that removes a v_cndmask (select) instruction if the values being selected are the same. We sometimes generate these when a 64-bit select is lowered to a pair of 32-bit selects, and the high or low halves of the 64-bit values were the same. I was hoping to fold this away during selection so we wouldn't have to do it later in SIFoldOperands.
How come this isn't picking up many of these test cases? Is it being run too late?