This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Call SimplifyDemandedBits to simplify EXTRACT_VECTOR_ELT
Changes PlannedPublic

Authored by foad on Sep 30 2020, 7:12 AM.

Diff Detail

Event Timeline

foad created this revision.Sep 30 2020, 7:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 30 2020, 7:12 AM
foad requested review of this revision.Sep 30 2020, 7:12 AM
foad added inline comments.Sep 30 2020, 7:19 AM
llvm/test/CodeGen/Mips/cconv/vector.ll
697–758

There are some regressions in this file but also some improvements. I haven't worked out what's going on yet.

llvm/test/CodeGen/Thumb2/mve-vecreduce-mla.ll
249–259

Regression here and in other cases that are now using muls instead of umull/umlal.

llvm/test/CodeGen/X86/vector-fshl-128.ll
178–179

Regression. Quite a few tests are now using pxor+punpckhdq instead of pshufd. I wonder if some kind of combine could spot this case and turn it back into pshufd.

foad added inline comments.Sep 30 2020, 7:34 AM
llvm/test/CodeGen/ARM/dagcombine-anyexttozeroext.ll
46–47

Regression.

llvm/test/CodeGen/ARM/vdup.ll
59–69

Regression in lots of cases in this file.

foad updated this revision to Diff 295595.Oct 1 2020, 9:15 AM

Rebase on D88570.

foad updated this revision to Diff 296629.Oct 7 2020, 2:57 AM

Rebase.

RKSimon added inline comments.Oct 7 2020, 3:44 AM
llvm/test/CodeGen/AMDGPU/sdiv64.ll
502

Is this a regression? It looks like we're lost track that we only need 1 element

llvm/test/CodeGen/ARM/func-argpassing-endian.ll
106 ↗(On Diff #296629)

regression? we're no longer doing a single multiple load

llvm/test/CodeGen/ARM/vdup.ll
59–69

poor duplicate/splat detection?

@foad Any update on this?

yubing added a subscriber: yubing.Dec 4 2020, 4:32 AM
yubing added inline comments.Dec 6 2020, 4:19 AM
llvm/test/CodeGen/X86/vector-fshr-128.ll
179–180

Hi, during combination before legalizeDAG. there are the following Nodes:

  t63: i32 = extract_vector_elt t62, Constant:i32<2>
  t64: i32 = extract_vector_elt t62, Constant:i32<3>
t66: v4i32 = BUILD_VECTOR t63, t64, undef:i32, undef:i32

SimplifyDemandedBits deduce that t64 is a zero, so t66 will transformed into a pxor and a punpckhqd instead of a single pshufd.

Maybe in such a case where a buildvector have elts which are extracted from the same vector, you shouldn't SimplifyDemandedBits for extract_vector_elt.

foad updated this revision to Diff 316897.Jan 15 2021, 3:53 AM

Rebase.

RKSimon added inline comments.Jan 15 2021, 5:54 AM
llvm/test/CodeGen/X86/buildvec-insertvec.ll
783

this should simplify to "store i32 undef, store i32* undef" and be removed - can you check why it isn't please?

llvm/test/CodeGen/X86/vec_setcc.ll
223

why didn't this simplify?

foad added inline comments.Jan 15 2021, 9:53 AM
llvm/test/CodeGen/X86/buildvec-insertvec.ll
783

The value being stored is not undef, it's either -2147483648 or poison, depending on the value of %a0.

Anyway -simplifycfg would change the store into a trap + unreachable, but nothing in llc's codegen pipeline does that.

llvm/test/CodeGen/X86/vec_setcc.ll
223

What simplification are you expecting?

foad added inline comments.Jan 18 2021, 3:40 AM
llvm/test/CodeGen/X86/buildvec-insertvec.ll
783

I've looked into this more carefully now. Hopefully this answer makes more sense.

With my patch %4 = extractelement <4 x i32> zeroinitializer, i32 %2 is simplified into i32 0 based on the known bits of all elements. This happens before we simplify %2, because of the weird way that the DAG combiner runs top-down.

When we visit %2 we simplify it to -2147483648. After that, if we visited the original %4 again, we would simplify it to undef; but with my patch we have already simplified %4 to 0 so it's too late.

I'm not sure what to do about this -- other than change DAGCombine to run bottom-up ;-)

RKSimon added inline comments.Jan 23 2021, 10:19 AM
llvm/test/CodeGen/X86/buildvec-insertvec.ll
783

Do we need to tweak the out-of-range handling to ISD::EXTRACT_VECTOR_ELT indices? Either just for constant indices or we use computeKnownBits to work out if the minimum value always the exceeds the vector element count?

@foad Please can you rebase this?

foad updated this revision to Diff 332981.Mar 24 2021, 7:18 AM

Rebase. There are two failing tests that I have not updated yet:

Failed Tests (2):

LLVM :: CodeGen/AMDGPU/cttz_zero_undef.ll
LLVM :: CodeGen/AMDGPU/scratch-simple.ll
foad added a comment.Mar 24 2021, 7:21 AM

Maybe in such a case where a buildvector have elts which are extracted from the same vector, you shouldn't SimplifyDemandedBits for extract_vector_elt.

Yes, I am coming round to this idea: don't simplify EXTRACT_VECTOR_ELT to a constant if it is used by BUILD_VECTOR, because of the risk of breaking things like shuffle patterns.

RKSimon added inline comments.Mar 24 2021, 7:52 AM
llvm/test/CodeGen/X86/nontemporal-3.ll
2–8

Please can you add a common CHECK prefix:

; RUN: llc < %s -mtriple=x86_64-unknown-unknown | FileCheck %s --check-prefixes=CHECK,SSE
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4a | FileCheck %s --check-prefixes=CHECK,SSE
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=CHECK,SSE
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=CHECK,AVX
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefixes=CHECK,AVX
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512dq,+avx512vl | FileCheck %s --check-prefixes=CHECK,AVX512
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl | FileCheck %s --check-prefixes=CHECK,AVX512
llvm/test/CodeGen/X86/vec_setcc.ll
223

The 0'th index should be able to extract from the source of the _EXTEND_VECTOR_INREG using SimplifyMultipleUseDemandedBits

foad updated this revision to Diff 333005.Mar 24 2021, 8:24 AM

Add a common CHECK prefix.

RKSimon added inline comments.Mar 29 2021, 3:07 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
18650

How come this isn't picking up many of these test cases? Is it being run too late?

foad added inline comments.Mar 29 2021, 3:16 AM
llvm/test/CodeGen/X86/vec_setcc.ll
223

Well %eax is extracted directly from the result of the pcmpeqw, which is the source of the sign_extend_vector_inreg.

@foad Are you intending to take another look at this at all? I'm wondering if could help with some of the regressions in D127115

Herald added a project: Restricted Project. · View Herald TranscriptJun 19 2022, 9:36 AM
foad updated this revision to Diff 438396.Jun 20 2022, 7:47 AM

Rebase.

foad added a comment.Jun 20 2022, 7:49 AM

There are some outstanding lit test problems:

CodeGen/AMDGPU/bug-v4f64-subvector.ll: timeout
CodeGen/AMDGPU/scratch-simple.ll: fails
CodeGen/Hexagon/autohvx/hfnosplat_cp.ll: fails
CodeGen/Thumb2/mve-sext-masked-load.ll: fails machine verification for the -early-live-intervals RUN line

foad added a comment.Jun 20 2022, 7:57 AM

@foad Are you intending to take another look at this at all? I'm wondering if could help with some of the regressions in D127115

TBH I had rather gone off this patch, because it seemed like it would interfere with tests where we want to recombine a bunch of extracts back into some kind of permute instruction. (If you see a bunch of extracts from the same source then you might try to do this, but if one of the extracts has been folded to a constant then it's much harder to spot.) But I have rebased it anyway.

RKSimon added a subscriber: deadalnix.

@deadalnix This patch has a number of outstanding problems, and might not be worth it - but can you tell if it helps D127115 at all?

@deadalnix This patch has a number of outstanding problems, and might not be worth it - but can you tell if it helps D127115 at all?

Testing that now, I'll let you know.

Testing that now, I'll let you know.

So it definitively affects the codegen, but it's not clear if this is better or worse.

RKSimon added a comment.EditedJun 21 2022, 8:08 AM

Testing that now, I'll let you know.

So it definitively affects the codegen, but it's not clear if this is better or worse.

Yeah that seems about right

@foad its up to you if you want to persevere, we already have SimplifyDemandedBits support for cases where all uses of the vector are extracts - it looks like we now know why it was never extended beyond that...

foad planned changes to this revision.Jun 22 2022, 2:52 AM

This clearly needs work. The original motivation was to be able to remove SIFoldOperands::tryFoldCndMask which is a MIR optimisation that removes a v_cndmask (select) instruction if the values being selected are the same. We sometimes generate these when a 64-bit select is lowered to a pair of 32-bit selects, and the high or low halves of the 64-bit values were the same. I was hoping to fold this away during selection so we wouldn't have to do it later in SIFoldOperands.