This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] allow peeking through GEPs when creating a vector load
ClosedPublic

Authored by spatel on Dec 14 2020, 8:46 AM.

Details

Summary

This is an enhancement motivated by https://llvm.org/PR16739 (see D92858 for another).

We can look through a GEP to find a base pointer that may be safe to use for a vector load. If so, then we shuffle (shift) the necessary vector element over to index 0.

Alive2 proof based on 1 of the regression tests:
https://alive2.llvm.org/ce/z/yPJLkh

The vector translation is independent of endian (verify by changing to leading 'E' in the datalayout string).

Diff Detail

Event Timeline

spatel created this revision.Dec 14 2020, 8:46 AM
spatel requested review of this revision.Dec 14 2020, 8:46 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2020, 8:46 AM
spatel added inline comments.Dec 14 2020, 9:03 AM
llvm/test/Transforms/VectorCombine/X86/load.ll
283

Note that the SSE2 cost model is conservatively giving:

{TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw
                                            // + pshufd/unpck

...so that's why we do not vectorize the tests here.

lebedev.ri added inline comments.Dec 14 2020, 9:04 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
152–154

I strongly suspect that you need to recalculate the Alignment here,
because i don't think the Offset-less pointer is guaranteed to still be Alignment-aligned.

spatel added inline comments.Dec 14 2020, 10:13 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
152–154

That's an interesting question. I think what we're doing here (even without this patch) is comparing the alignment of the original scalar load with the alignment of the pointer of a new vector load. But that's not very meaningful is it? For example, if we are loading an i16 with align 2 then does it matter whether the original pointer is at least align 2 for a load of v8i16?

I haven't looked at alignment requirements much. If there's another related transform that we can use as a template, let me know.

lebedev.ri requested changes to this revision.Dec 14 2020, 10:44 AM
lebedev.ri added subscribers: aqjune, nlopes.

I'm having trouble coming up with an example because there appears to be a preexisting soundness problems, example: (CC @nlopes @aqjune)

define <8 x i16> @t(i8* align 128 dereferenceable(128) %base) {
  %ptr = getelementptr inbounds i8, i8* %base, i64 1
  %p = bitcast i8* %ptr to <8 x i16>*

  %gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
  %s = load i16, i16* %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
* dereferenceable(128) align(128) %base = pointer(non-local, block_id=1, offset=1664)

Source:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
i16 %s = poison
<8 x i16> %r = < poison, any, any, any, any, any, any, any >

SOURCE MEMORY STATE
===================
NON-LOCAL BLOCKS:
Block 0 >       size: 0 align: 1        alloc type: 0
Block 1 >       size: 2048      align: 128      alloc type: 0

Target:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
* %1 = pointer(non-local, block_id=1, offset=1667)
<8 x i16> %r = < poison, poison, poison, poison, poison, poison, poison, poison >
Source value: < poison, any, any, any, any, any, any, any >
Target value: < poison, poison, poison, poison, poison, poison, poison, poison >

Alive2: Transform doesn't verify!
This revision now requires changes to proceed.Dec 14 2020, 10:44 AM

I'm having trouble coming up with an example because there appears to be a preexisting soundness problems, example: (CC @nlopes @aqjune)

define <8 x i16> @t(i8* align 128 dereferenceable(128) %base) {
  %ptr = getelementptr inbounds i8, i8* %base, i64 1
  %p = bitcast i8* %ptr to <8 x i16>*

  %gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
  %s = load i16, i16* %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
* dereferenceable(128) align(128) %base = pointer(non-local, block_id=1, offset=1664)

Source:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
i16 %s = poison
<8 x i16> %r = < poison, any, any, any, any, any, any, any >

SOURCE MEMORY STATE
===================
NON-LOCAL BLOCKS:
Block 0 >       size: 0 align: 1        alloc type: 0
Block 1 >       size: 2048      align: 128      alloc type: 0

Target:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
* %1 = pointer(non-local, block_id=1, offset=1667)
<8 x i16> %r = < poison, poison, poison, poison, poison, poison, poison, poison >
Source value: < poison, any, any, any, any, any, any, any >
Target value: < poison, poison, poison, poison, poison, poison, poison, poison >

Alive2: Transform doesn't verify!

IIUC, this is a question of allowing poison (from the unused loaded memory elements) to propagate?
So we have to freeze or explicitly make those elements undef again?
https://alive2.llvm.org/ce/z/LKqBVW

I'm having trouble coming up with an example because there appears to be a preexisting soundness problems, example: (CC @nlopes @aqjune)

define <8 x i16> @t(i8* align 128 dereferenceable(128) %base) {
  %ptr = getelementptr inbounds i8, i8* %base, i64 1
  %p = bitcast i8* %ptr to <8 x i16>*

  %gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
  %s = load i16, i16* %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
* dereferenceable(128) align(128) %base = pointer(non-local, block_id=1, offset=1664)

Source:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
i16 %s = poison
<8 x i16> %r = < poison, any, any, any, any, any, any, any >

SOURCE MEMORY STATE
===================
NON-LOCAL BLOCKS:
Block 0 >       size: 0 align: 1        alloc type: 0
Block 1 >       size: 2048      align: 128      alloc type: 0

Target:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
* %1 = pointer(non-local, block_id=1, offset=1667)
<8 x i16> %r = < poison, poison, poison, poison, poison, poison, poison, poison >
Source value: < poison, any, any, any, any, any, any, any >
Target value: < poison, poison, poison, poison, poison, poison, poison, poison >

Alive2: Transform doesn't verify!

IIUC, this is a question of allowing poison (from the unused loaded memory elements) to propagate?
So we have to freeze or explicitly make those elements undef again?
https://alive2.llvm.org/ce/z/LKqBVW

That is how i read it, yes. That will be gone in codegen, so no need to cost that extra legality shuffle.

/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!

Makes sense; it replaces an undef vector with a potential poison vector that might be in memory.
We need to switch to using poison as vector placeholders rather than undef so these problems go away. I guess clang needs a bit of patching (or IRBuilder, or both?).

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

spatel updated this revision to Diff 311870.Dec 15 2020, 4:59 AM

Patch updated:
Rebased after D93238 (fix poison bug) and added code to update the alignment + extra test to confirm.
There was an existing test with conflicting alignment specifiers that also changes now. Stepping through that, I noticed that getPointerAlignment() does not peek through gep, so there may still be room for improvement.

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

If we switch to initialize with poison elements, I think we would have to swap undef with poison in regression tests and pattern matching in instsimplify/instcombine at the same time. Otherwise, we will be testing/folding the wrong patterns?

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

If we switch to initialize with poison elements, I think we would have to swap undef with poison in regression tests and pattern matching in instsimplify/instcombine at the same time. Otherwise, we will be testing/folding the wrong patterns?

Yes, I think so. What do you think about having three patches
(1) Poison placeholder patch (2) InstSimpify/InstCombine patch (3) The regression tests update patch.
and landing those consecutively when all of those are accepted? Patch (1) and (2) should still pass ninja check. It will show which tests are related to what.
I think I can prepare (2) and (3). The changes in regression tests (3) can be syntactically done with a script.
I tried this and it worked: sed -i.backup 's/insertelement <\(.*\)> undef,/insertelement <\1> poison,/g' (filename)
(sorry, I meant insertelement, not insertvalue)

lebedev.ri added inline comments.Dec 15 2020, 11:52 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
152–154

I think we don't need to ever ask isSafeToLoadUnconditionally() about alignment
(i.e. just pass 0/1, whatever is more permissive),
because we already know the alignment of the load.

We simply need to recalculate the alignment after chopping off the offset
(see commonAlignment(Align A, uint64_t Offset))

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

If we switch to initialize with poison elements, I think we would have to swap undef with poison in regression tests and pattern matching in instsimplify/instcombine at the same time. Otherwise, we will be testing/folding the wrong patterns?

Yes, I think so. What do you think about having three patches
(1) Poison placeholder patch (2) InstSimpify/InstCombine patch (3) The regression tests update patch.
and landing those consecutively when all of those are accepted? Patch (1) and (2) should still pass ninja check. It will show which tests are related to what.
I think I can prepare (2) and (3). The changes in regression tests (3) can be syntactically done with a script.
I tried this and it worked: sed -i.backup 's/insertelement <\(.*\)> undef,/insertelement <\1> poison,/g' (filename)

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

  1. Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.
  2. Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)
  3. Run Alive2 through all of the updated regression tests to verify.
  4. Update codegen to deal with poison constant?
  5. Update folds from step 2 to create poison rather than undef (if that's what they were doing)?
  6. Change IRBuilder or other instruction creators to create poison from the start.
spatel updated this revision to Diff 312292.Dec 16 2020, 1:29 PM

Patch updated:
D93397 / D93406 improved the basic alignment logic, so that is adjusted here to include the effect of gep offset.
In the last test diff, note that the final alignment is not the same as either of the starting alignments.

lebedev.ri added inline comments.Dec 16 2020, 1:38 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
163

So we had SrcPtr, and split it into a base pointer SrcPtr', and an offset Offset.
But i think it's SrcPtr = SrcPtr' + Offset,
so shouldn't the offset be negative here,
because the alignment is known for SrcPtr, not SrcPtr'?

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

  1. Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.
  2. Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)

m_Undef already matches poison because PoisonValue is a subclass of UndefValue :)

  1. Run Alive2 through all of the updated regression tests to verify.
  2. Update codegen to deal with poison constant?
  3. Update folds from step 2 to create poison rather than undef (if that's what they were doing)?
  4. Change IRBuilder or other instruction creators to create poison from the start.

I think we can split the goal and first work on replacing the placeholder value that is used by insertelement only.
In this case, actually InstSimplify/InstCombine change isn't necessary as well. It can be done only when suboptimal assembly is being generated.
What do you think?

spatel marked 3 inline comments as done.Dec 17 2020, 5:53 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
163

Good point!
Intuitively, I couldn't see how it would make a difference if we negated the offset, so I looked at the implementation of MinAlign():
https://github.com/llvm/llvm-project/blob/4c8276cdc120c24410dcd62a9986f04e7327fc2f/llvm/include/llvm/Support/MathExtras.h#L673

The optimized IR for that is:

define i64 @MinAlign(i64 %A, i64 %B) local_unnamed_addr #0 {
entry:
  %or = or i64 %B, %A
  %add = sub i64 0, %or
  %and = and i64 %or, %add
  ret i64 %and
}

Double-check to make sure I didn't mess this up:
https://alive2.llvm.org/ce/z/wRSjPD

So it's logically equivalent either way? At the least, I'll add a code comment.

Memory is confusing, pointing out some more potential issues.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
158–160

Is there a test coverage for edge case[s]?
Doesn't this check only ensure that we can load a single byte, not the entire element?

162

Don't we need to ensure that the byte offset is a multiple of element size?

163

STGM

215–216

Assert that the division is exact?

lebedev.ri added inline comments.Dec 17 2020, 6:34 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
215–216

And also assert that Mask[0] < MinVecNumElts.

spatel marked 5 inline comments as done.Dec 17 2020, 10:01 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
162

Yes, good catch. We have potentially looked through casts at this point, so anything is possible. I'll add test(s).

spatel updated this revision to Diff 312561.Dec 17 2020, 10:56 AM
spatel marked an inline comment as done.

Patch updated:

  1. Added constraints on address offset with respect to vector element size.
  2. Added negative tests for those.
  3. Added asserts to make sure address assumptions are valid.
  4. Added code comment about offset math (negation) logic.

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

  1. Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.
  2. Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)

m_Undef already matches poison because PoisonValue is a subclass of UndefValue :)

  1. Run Alive2 through all of the updated regression tests to verify.
  2. Update codegen to deal with poison constant?
  3. Update folds from step 2 to create poison rather than undef (if that's what they were doing)?
  4. Change IRBuilder or other instruction creators to create poison from the start.

I think we can split the goal and first work on replacing the placeholder value that is used by insertelement only.
In this case, actually InstSimplify/InstCombine change isn't necessary as well. It can be done only when suboptimal assembly is being generated.
What do you think?

Ok - can you create that patch? (We should move this conversation to llvm-dev or another review to reduce confusion.)

lebedev.ri added inline comments.Dec 17 2020, 12:09 PM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
155–156

I think this should be done later, after cheaper checks.

168–170

We want to load a single element of X bytes,
by instead loading Y such elements at once.
We've dissected the pointer into a base pointer, and a byte offset.
We know that the byte offset is a multiple of element size.
We need to ensure that if we load Y elements from the base pointer,
we still load the element we are after.
I think approaching that check from byte count is highly confusing.
(at least i already spent too much time trying to check this)
Let's do a much more obvious, element count based check.

spatel updated this revision to Diff 312611.Dec 17 2020, 1:57 PM
spatel marked 3 inline comments as done.

Patch updated:
Adjusted order/spelling of offset checks as suggested. Also added a test (gep012) just below the offset cut-off, so we have positive and negative tests at the boundary condition.

lebedev.ri accepted this revision.Dec 17 2020, 2:11 PM

Ok, this looks about right to me.

I'm not sure if we'll want to lift the restrictions that

(1) the offset must be a multiple of element size

and (2) that we try to load from the base pointer, which only works if the offset is small enough.

I've checked, and alive2 claims that this is endianness-agnostic.

Thanks.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
145–179

And now that we already calculate NumEltsInOffset,
shall we just use/record NumEltsInOffset instead of recalculating it from OffsetInBits later?
(It them might use a better name, something like ElementIndex maybe)

215–216

(If we no longer recompute them here, the asserts can obviously go away to)

This revision is now accepted and ready to land.Dec 17 2020, 2:11 PM

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

  1. Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.
  2. Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)

m_Undef already matches poison because PoisonValue is a subclass of UndefValue :)

  1. Run Alive2 through all of the updated regression tests to verify.
  2. Update codegen to deal with poison constant?
  3. Update folds from step 2 to create poison rather than undef (if that's what they were doing)?
  4. Change IRBuilder or other instruction creators to create poison from the start.

I think we can split the goal and first work on replacing the placeholder value that is used by insertelement only.
In this case, actually InstSimplify/InstCombine change isn't necessary as well. It can be done only when suboptimal assembly is being generated.
What do you think?

Ok - can you create that patch? (We should move this conversation to llvm-dev or another review to reduce confusion.)

Sure. I'll make a patch this weekend and send a mail to llvm-dev.

spatel marked 2 inline comments as done.Dec 18 2020, 5:47 AM
spatel added inline comments.
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
145–179

Yes - no need to carry the offset as both an index and a bit value. Keeping both makes it harder to read. Thanks for the thorough review!

This revision was automatically updated to reflect the committed changes.
spatel marked an inline comment as done.

I made a patch here: D93586.
It touches InstCombinerImpl::SimplifyDemandedVectorElts only to keep the size of diff manageable.
Sent a mail to llvm-dev too