This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
16/16
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
1/1
load.ll

Differential D93229

[VectorCombine] allow peeking through GEPs when creating a vector load
ClosedPublic

Authored by spatel on Dec 14 2020, 8:46 AM.

Download Raw Diff

Details

Reviewers

RKSimon
lebedev.ri
xbolva00

Commits

rG47aaa99c0e1e: [VectorCombine] allow peeking through GEPs when creating a vector load

Summary

This is an enhancement motivated by https://llvm.org/PR16739 (see D92858 for another).

We can look through a GEP to find a base pointer that may be safe to use for a vector load. If so, then we shuffle (shift) the necessary vector element over to index 0.

Alive2 proof based on 1 of the regression tests:
https://alive2.llvm.org/ce/z/yPJLkh

The vector translation is independent of endian (verify by changing to leading 'E' in the datalayout string).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Dec 14 2020, 8:46 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptDec 14 2020, 8:46 AM

spatel requested review of this revision.Dec 14 2020, 8:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2020, 8:46 AM

spatel added inline comments.Dec 14 2020, 9:03 AM

llvm/test/Transforms/VectorCombine/X86/load.ll
283	Note that the SSE2 cost model is conservatively giving: {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2pshuflw + 2pshufhw // + pshufd/unpck ...so that's why we do not vectorize the tests here.

lebedev.ri added inline comments.Dec 14 2020, 9:04 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
152–154	I strongly suspect that you need to recalculate the `Alignment` here, because i don't think the `Offset`-less pointer is guaranteed to still be `Alignment`-aligned.

spatel added inline comments.Dec 14 2020, 10:13 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
152–154	That's an interesting question. I think what we're doing here (even without this patch) is comparing the alignment of the original scalar load with the alignment of the pointer of a new vector load. But that's not very meaningful is it? For example, if we are loading an i16 with `align 2` then does it matter whether the original pointer is at least `align 2` for a load of v8i16? I haven't looked at alignment requirements much. If there's another related transform that we can use as a template, let me know.

I'm having trouble coming up with an example because there appears to be a preexisting soundness problems, example: (CC @nlopes @aqjune)

define <8 x i16> @t(i8* align 128 dereferenceable(128) %base) {
  %ptr = getelementptr inbounds i8, i8* %base, i64 1
  %p = bitcast i8* %ptr to <8 x i16>*

  %gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
  %s = load i16, i16* %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}

/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
* dereferenceable(128) align(128) %base = pointer(non-local, block_id=1, offset=1664)

Source:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
i16 %s = poison
<8 x i16> %r = < poison, any, any, any, any, any, any, any >

SOURCE MEMORY STATE
===================
NON-LOCAL BLOCKS:
Block 0 >       size: 0 align: 1        alloc type: 0
Block 1 >       size: 2048      align: 128      alloc type: 0

Target:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
* %1 = pointer(non-local, block_id=1, offset=1667)
<8 x i16> %r = < poison, poison, poison, poison, poison, poison, poison, poison >
Source value: < poison, any, any, any, any, any, any, any >
Target value: < poison, poison, poison, poison, poison, poison, poison, poison >

Alive2: Transform doesn't verify!

This revision now requires changes to proceed.Dec 14 2020, 10:44 AM

In D93229#2452695, @lebedev.ri wrote:

I'm having trouble coming up with an example because there appears to be a preexisting soundness problems, example: (CC @nlopes @aqjune)

define <8 x i16> @t(i8* align 128 dereferenceable(128) %base) {
  %ptr = getelementptr inbounds i8, i8* %base, i64 1
  %p = bitcast i8* %ptr to <8 x i16>*

  %gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
  %s = load i16, i16* %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}

/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
* dereferenceable(128) align(128) %base = pointer(non-local, block_id=1, offset=1664)

Source:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
i16 %s = poison
<8 x i16> %r = < poison, any, any, any, any, any, any, any >

SOURCE MEMORY STATE
===================
NON-LOCAL BLOCKS:
Block 0 >       size: 0 align: 1        alloc type: 0
Block 1 >       size: 2048      align: 128      alloc type: 0

Target:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
* %1 = pointer(non-local, block_id=1, offset=1667)
<8 x i16> %r = < poison, poison, poison, poison, poison, poison, poison, poison >
Source value: < poison, any, any, any, any, any, any, any >
Target value: < poison, poison, poison, poison, poison, poison, poison, poison >

Alive2: Transform doesn't verify!

IIUC, this is a question of allowing poison (from the unused loaded memory elements) to propagate?
So we have to freeze or explicitly make those elements undef again?
https://alive2.llvm.org/ce/z/LKqBVW

In D93229#2452751, @spatel wrote:

In D93229#2452695, @lebedev.ri wrote:

I'm having trouble coming up with an example because there appears to be a preexisting soundness problems, example: (CC @nlopes @aqjune)

define <8 x i16> @t(i8* align 128 dereferenceable(128) %base) {
  %ptr = getelementptr inbounds i8, i8* %base, i64 1
  %p = bitcast i8* %ptr to <8 x i16>*

  %gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
  %s = load i16, i16* %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}

/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!
ERROR: Target is more poisonous than source

Example:
* dereferenceable(128) align(128) %base = pointer(non-local, block_id=1, offset=1664)

Source:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
i16 %s = poison
<8 x i16> %r = < poison, any, any, any, any, any, any, any >

SOURCE MEMORY STATE
===================
NON-LOCAL BLOCKS:
Block 0 >       size: 0 align: 1        alloc type: 0
Block 1 >       size: 2048      align: 128      alloc type: 0

Target:
* %ptr = pointer(non-local, block_id=1, offset=1665)
* %p = pointer(non-local, block_id=1, offset=1665)
* %gep = pointer(non-local, block_id=1, offset=1667)
* %1 = pointer(non-local, block_id=1, offset=1667)
<8 x i16> %r = < poison, poison, poison, poison, poison, poison, poison, poison >
Source value: < poison, any, any, any, any, any, any, any >
Target value: < poison, poison, poison, poison, poison, poison, poison, poison >

Alive2: Transform doesn't verify!

That is how i read it, yes. That will be gone in codegen, so no need to cost that extra legality shuffle.

In D93229#2452695, @lebedev.ri wrote:

/builddirs/llvm-project/build-Clang11-unknown$ /builddirs/llvm-project/build-Clang11-unknown/bin/opt -load /repositories/alive2/build-Clang-release/tv/tv.so -tv -vector-combine -mtriple=x86_64-- -mattr=avx2 -tv -o /dev/null --tv-smt-to=60000 /tmp/D93229.ll 

----------------------------------------
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %s = load i16, * %gep, align 1
  %r = insertelement <8 x i16> undef, i16 %s, i64 0
  ret <8 x i16> %r
}
=>
define <8 x i16> @t(* dereferenceable(128) align(128) %base) {
%0:
  %ptr = gep inbounds * dereferenceable(128) align(128) %base, 1 x i64 1
  %p = bitcast * %ptr to *
  %gep = gep inbounds * %p, 16 x i64 0, 2 x i64 1
  %1 = bitcast * %gep to *
  %r = load <8 x i16>, * %1, align 1
  ret <8 x i16> %r
}
Transformation doesn't verify!

Makes sense; it replaces an undef vector with a potential poison vector that might be in memory.
We need to switch to using poison as vector placeholders rather than undef so these problems go away. I guess clang needs a bit of patching (or IRBuilder, or both?).

spatel mentioned this in D93238: [VectorCombine] make load transform poison-safe.Dec 14 2020, 12:21 PM

spatel mentioned this in rGd399f870b5a9: [VectorCombine] make load transform poison-safe.Dec 14 2020, 2:42 PM

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

Patch updated:
Rebased after D93238 (fix poison bug) and added code to update the alignment + extra test to confirm.
There was an existing test with conflicting alignment specifiers that also changes now. Stepping through that, I noticed that getPointerAlignment() does not peek through gep, so there may still be room for improvement.

In D93229#2453538, @aqjune wrote:

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

If we switch to initialize with poison elements, I think we would have to swap undef with poison in regression tests and pattern matching in instsimplify/instcombine at the same time. Otherwise, we will be testing/folding the wrong patterns?

In D93229#2454698, @spatel wrote:

In D93229#2453538, @aqjune wrote:

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

If we switch to initialize with poison elements, I think we would have to swap undef with poison in regression tests and pattern matching in instsimplify/instcombine at the same time. Otherwise, we will be testing/folding the wrong patterns?

Yes, I think so. What do you think about having three patches
(1) Poison placeholder patch (2) InstSimpify/InstCombine patch (3) The regression tests update patch.
and landing those consecutively when all of those are accepted? Patch (1) and (2) should still pass ninja check. It will show which tests are related to what.
I think I can prepare (2) and (3). The changes in regression tests (3) can be syntactically done with a script.
I tried this and it worked: sed -i.backup 's/insertelement <\(.*\)> undef,/insertelement <\1> poison,/g' (filename)
(sorry, I meant insertelement, not insertvalue)

lebedev.ri added inline comments.Dec 15 2020, 11:52 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
152–154	I think we don't need to ever ask `isSafeToLoadUnconditionally()` about alignment (i.e. just pass `0`/`1`, whatever is more permissive), because we already know the alignment of the load. We simply need to recalculate the alignment after chopping off the offset (see `commonAlignment(Align A, uint64_t Offset)`)

In D93229#2456676, @aqjune wrote:

In D93229#2454698, @spatel wrote:

In D93229#2453538, @aqjune wrote:

Yes, we might want to know who's generating insertvalue undef and replace it with insertvalue poison.
The shufflevector pattern works, but using poison as a placeholder will remove latent bugs.
Maybe it is time to use poison?

If we switch to initialize with poison elements, I think we would have to swap undef with poison in regression tests and pattern matching in instsimplify/instcombine at the same time. Otherwise, we will be testing/folding the wrong patterns?

Yes, I think so. What do you think about having three patches
(1) Poison placeholder patch (2) InstSimpify/InstCombine patch (3) The regression tests update patch.
and landing those consecutively when all of those are accepted? Patch (1) and (2) should still pass ninja check. It will show which tests are related to what.
I think I can prepare (2) and (3). The changes in regression tests (3) can be syntactically done with a script.
I tried this and it worked: sed -i.backup 's/insertelement <\(.*\)> undef,/insertelement <\1> poison,/g' (filename)

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.
Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)
Run Alive2 through all of the updated regression tests to verify.
Update codegen to deal with poison constant?
Update folds from step 2 to create poison rather than undef (if that's what they were doing)?
Change IRBuilder or other instruction creators to create poison from the start.

spatel mentioned this in D93397: [VectorCombine] loosen alignment constraint for load transform.Dec 16 2020, 7:25 AM

spatel mentioned this in rGaaaf0ec72b06: [VectorCombine] loosen alignment constraint for load transform.Dec 16 2020, 9:27 AM

spatel mentioned this in D93406: [VectorCombine] optimize alignment for load transform.Dec 16 2020, 10:08 AM

spatel mentioned this in rG38ebc1a13dc8: [VectorCombine] optimize alignment for load transform.Dec 16 2020, 12:26 PM

Patch updated:
D93397 / D93406 improved the basic alignment logic, so that is adjusted here to include the effect of gep offset.
In the last test diff, note that the final alignment is not the same as either of the starting alignments.

lebedev.ri added inline comments.Dec 16 2020, 1:38 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
163	So we had `SrcPtr`, and split it into a base pointer `SrcPtr'`, and an offset `Offset`. But i think it's `SrcPtr = SrcPtr' + Offset`, so shouldn't the offset be negative here, because the alignment is known for `SrcPtr`, not `SrcPtr'`?

In D93229#2457696, @spatel wrote:

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.

Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)

m_Undef already matches poison because PoisonValue is a subclass of UndefValue :)

Run Alive2 through all of the updated regression tests to verify.

Update codegen to deal with poison constant?

Update folds from step 2 to create poison rather than undef (if that's what they were doing)?

Change IRBuilder or other instruction creators to create poison from the start.

I think we can split the goal and first work on replacing the placeholder value that is used by insertelement only.
In this case, actually InstSimplify/InstCombine change isn't necessary as well. It can be done only when suboptimal assembly is being generated.
What do you think?

spatel marked 3 inline comments as done.Dec 17 2020, 5:53 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
163	Good point! Intuitively, I couldn't see how it would make a difference if we negated the offset, so I looked at the implementation of MinAlign(): https://github.com/llvm/llvm-project/blob/4c8276cdc120c24410dcd62a9986f04e7327fc2f/llvm/include/llvm/Support/MathExtras.h#L673 The optimized IR for that is: define i64 @MinAlign(i64 %A, i64 %B) local_unnamed_addr #0 { entry: %or = or i64 %B, %A %add = sub i64 0, %or %and = and i64 %or, %add ret i64 %and } Double-check to make sure I didn't mess this up: https://alive2.llvm.org/ce/z/wRSjPD So it's logically equivalent either way? At the least, I'll add a code comment.

Memory is confusing, pointing out some more potential issues.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
158–160	Is there a test coverage for edge case[s]? Doesn't this check only ensure that we can load a single byte, not the entire element?
162	Don't we need to ensure that the byte offset is a multiple of element size?
163	STGM
215–216	Assert that the division is exact?

lebedev.ri added inline comments.Dec 17 2020, 6:34 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
215–216	And also assert that `Mask[0] < MinVecNumElts`.

spatel marked 5 inline comments as done.Dec 17 2020, 10:01 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
162	Yes, good catch. We have potentially looked through casts at this point, so anything is possible. I'll add test(s).

Patch updated:

Added constraints on address offset with respect to vector element size.
Added negative tests for those.
Added asserts to make sure address assumptions are valid.
Added code comment about offset math (negation) logic.

In D93229#2459265, @aqjune wrote:

In D93229#2457696, @spatel wrote:

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.

Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)

m_Undef already matches poison because PoisonValue is a subclass of UndefValue :)

Run Alive2 through all of the updated regression tests to verify.

Update codegen to deal with poison constant?

Update folds from step 2 to create poison rather than undef (if that's what they were doing)?

Change IRBuilder or other instruction creators to create poison from the start.

I think we can split the goal and first work on replacing the placeholder value that is used by insertelement only.
In this case, actually InstSimplify/InstCombine change isn't necessary as well. It can be done only when suboptimal assembly is being generated.
What do you think?

Ok - can you create that patch? (We should move this conversation to llvm-dev or another review to reduce confusion.)

lebedev.ri added inline comments.Dec 17 2020, 12:09 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
155–156	I think this should be done later, after cheaper checks.
168–170	We want to load a single element of X bytes, by instead loading Y such elements at once. We've dissected the pointer into a base pointer, and a byte offset. We know that the byte offset is a multiple of element size. We need to ensure that if we load Y elements from the base pointer, we still load the element we are after. I think approaching that check from byte count is highly confusing. (at least i already spent too much time trying to check this) Let's do a much more obvious, element count based check.

Patch updated:
Adjusted order/spelling of offset checks as suggested. Also added a test (gep012) just below the offset cut-off, so we have positive and negative tests at the boundary condition.

Ok, this looks about right to me.

I'm not sure if we'll want to lift the restrictions that

(1) the offset must be a multiple of element size

and (2) that we try to load from the base pointer, which only works if the offset is small enough.

I've checked, and alive2 claims that this is endianness-agnostic.

Thanks.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
145–179	And now that we already calculate `NumEltsInOffset`, shall we just use/record `NumEltsInOffset` instead of recalculating it from `OffsetInBits` later? (It them might use a better name, something like `ElementIndex` maybe)
215–216	(If we no longer recompute them here, the asserts can obviously go away to)

This revision is now accepted and ready to land.Dec 17 2020, 2:11 PM

In D93229#2461069, @spatel wrote:

In D93229#2459265, @aqjune wrote:

In D93229#2457696, @spatel wrote:

We can use an alternate plan to make incremental progress, but it will cause redundancy while we transition:

Replicate all of the regression tests with the poison constant instead of undef. If we add some unique TODO text marker on all of those new tests, we can then grep to make sure everything that we expect to get updated is actually updated in later steps.

Update instcombine/simplify/vectorizer folds to match poison patterns (create a m_UndefOrPoison() pattern matcher?)

m_Undef already matches poison because PoisonValue is a subclass of UndefValue :)

Run Alive2 through all of the updated regression tests to verify.

Update codegen to deal with poison constant?

Update folds from step 2 to create poison rather than undef (if that's what they were doing)?

Change IRBuilder or other instruction creators to create poison from the start.

I think we can split the goal and first work on replacing the placeholder value that is used by insertelement only.
In this case, actually InstSimplify/InstCombine change isn't necessary as well. It can be done only when suboptimal assembly is being generated.
What do you think?

Ok - can you create that patch? (We should move this conversation to llvm-dev or another review to reduce confusion.)

Sure. I'll make a patch this weekend and send a mail to llvm-dev.

spatel marked 2 inline comments as done.Dec 18 2020, 5:47 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
145–179	Yes - no need to carry the offset as both an index and a bit value. Keeping both makes it harder to read. Thanks for the thorough review!

Closed by commit rG47aaa99c0e1e: [VectorCombine] allow peeking through GEPs when creating a vector load (authored by spatel). · Explain WhyDec 18 2020, 6:25 AM

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

spatel added a commit: rG47aaa99c0e1e: [VectorCombine] allow peeking through GEPs when creating a vector load.

I made a patch here: D93586.
It touches InstCombinerImpl::SimplifyDemandedVectorElts only to keep the size of diff manageable.
Sent a mail to llvm-dev too

frasercrmck mentioned this in D121787: [VectorCombine] Insert addrspacecast when crossing address space boundaries.Mar 16 2022, 3:30 AM

frasercrmck mentioned this in rG2e44b7872bc6: [VectorCombine] Insert addrspacecast when crossing address space boundaries.Mar 24 2022, 12:19 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

54 lines

test/

Transforms/

VectorCombine/

X86/

load.ll

73 lines

Diff 312783

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines

static void replaceValue(Value &Old, Value &New) { static void replaceValue(Value &Old, Value &New) {

Old.replaceAllUsesWith(&New); Old.replaceAllUsesWith(&New);

New.takeName(&Old); New.takeName(&Old);

} }

bool VectorCombine::vectorizeLoadInsert(Instruction &I) { bool VectorCombine::vectorizeLoadInsert(Instruction &I) {

// Match insert into fixed vector of scalar value. // Match insert into fixed vector of scalar value.

// TODO: Handle non-zero insert index.

auto *Ty = dyn_cast<FixedVectorType>(I.getType()); auto *Ty = dyn_cast<FixedVectorType>(I.getType());

Value *Scalar; Value *Scalar;

if (!Ty || !match(&I, m_InsertElt(m_Undef(), m_Value(Scalar), m_ZeroInt())) || if (!Ty || !match(&I, m_InsertElt(m_Undef(), m_Value(Scalar), m_ZeroInt())) ||

!Scalar->hasOneUse()) !Scalar->hasOneUse())

return false; return false;

// Optionally match an extract from another vector. // Optionally match an extract from another vector.

Value *X; Value *X;

bool HasExtract = match(Scalar, m_ExtractElt(m_Value(X), m_ZeroInt())); bool HasExtract = match(Scalar, m_ExtractElt(m_Value(X), m_ZeroInt()));

if (!HasExtract) if (!HasExtract)

X = Scalar; X = Scalar;

// Match source value as load of scalar or vector. // Match source value as load of scalar or vector.

// Do not vectorize scalar load (widening) if atomic/volatile or under // Do not vectorize scalar load (widening) if atomic/volatile or under

// asan/hwasan/memtag/tsan. The widened load may load data from dirty regions // asan/hwasan/memtag/tsan. The widened load may load data from dirty regions

// or create data races non-existent in the source. // or create data races non-existent in the source.

auto *Load = dyn_cast<LoadInst>(X); auto *Load = dyn_cast<LoadInst>(X);

if (!Load || !Load->isSimple() || !Load->hasOneUse() || if (!Load || !Load->isSimple() || !Load->hasOneUse() ||

Load->getFunction()->hasFnAttribute(Attribute::SanitizeMemTag) || Load->getFunction()->hasFnAttribute(Attribute::SanitizeMemTag) ||

mustSuppressSpeculation(*Load)) mustSuppressSpeculation(*Load))

return false; return false;

// TODO: Extend this to match GEP with constant offsets.

const DataLayout &DL = I.getModule()->getDataLayout(); const DataLayout &DL = I.getModule()->getDataLayout();

Value *SrcPtr = Load->getPointerOperand()->stripPointerCasts(); Value *SrcPtr = Load->getPointerOperand()->stripPointerCasts();

assert(isa<PointerType>(SrcPtr->getType()) && "Expected a pointer type"); assert(isa<PointerType>(SrcPtr->getType()) && "Expected a pointer type");

// If original AS != Load's AS, we can't bitcast the original pointer and have // If original AS != Load's AS, we can't bitcast the original pointer and have

// to use Load's operand instead. Ideally we would want to strip pointer casts // to use Load's operand instead. Ideally we would want to strip pointer casts

// without changing AS, but there's no API to do that ATM. // without changing AS, but there's no API to do that ATM.

unsigned AS = Load->getPointerAddressSpace(); unsigned AS = Load->getPointerAddressSpace();

if (AS != SrcPtr->getType()->getPointerAddressSpace()) if (AS != SrcPtr->getType()->getPointerAddressSpace())

SrcPtr = Load->getPointerOperand(); SrcPtr = Load->getPointerOperand();

// We are potentially transforming byte-sized (8-bit) memory accesses, so make

// sure we have all of our type-based constraints in place for this target.

Type *ScalarTy = Scalar->getType(); Type *ScalarTy = Scalar->getType();

uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits(); uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits();

unsigned MinVectorSize = TTI.getMinVectorRegisterBitWidth(); unsigned MinVectorSize = TTI.getMinVectorRegisterBitWidth();

if (!ScalarSize || !MinVectorSize || MinVectorSize % ScalarSize != 0) if (!ScalarSize || !MinVectorSize || MinVectorSize % ScalarSize != 0 ||

ScalarSize % 8 != 0)

return false; return false;

// Check safety of replacing the scalar load with a larger vector load. // Check safety of replacing the scalar load with a larger vector load.

// We use minimal alignment (maximum flexibility) because we only care about // We use minimal alignment (maximum flexibility) because we only care about

// the dereferenceable region. When calculating cost and creating a new op, // the dereferenceable region. When calculating cost and creating a new op,

// we may use a larger value based on alignment attributes. // we may use a larger value based on alignment attributes.

unsigned MinVecNumElts = MinVectorSize / ScalarSize; unsigned MinVecNumElts = MinVectorSize / ScalarSize;

auto *MinVecTy = VectorType::get(ScalarTy, MinVecNumElts, false); auto *MinVecTy = VectorType::get(ScalarTy, MinVecNumElts, false);

unsigned OffsetEltIndex = 0;

Align Alignment = Load->getAlign();

if (!isSafeToLoadUnconditionally(SrcPtr, MinVecTy, Align(1), DL, Load, &DT)) {

// It is not safe to load directly from the pointer, but we can still peek

// through gep offsets and check if it safe to load from a base address with

// updated alignment. If it is, we can shuffle the element(s) into place

// after loading.

unsigned OffsetBitWidth = DL.getIndexTypeSizeInBits(SrcPtr->getType());

APInt Offset(OffsetBitWidth, 0);

SrcPtr = SrcPtr->stripAndAccumulateInBoundsConstantOffsets(DL, Offset);

lebedev.riUnsubmitted

Done

I strongly suspect that you need to recalculate the Alignment here,
because i don't think the Offset-less pointer is guaranteed to still be Alignment-aligned.

lebedev.ri: I strongly suspect that you need to recalculate the `Alignment` here, because i don't think the…

spatelAuthorUnsubmitted

Done

That's an interesting question. I think what we're doing here (even without this patch) is comparing the alignment of the original scalar load with the alignment of the pointer of a new vector load. But that's not very meaningful is it? For example, if we are loading an i16 with align 2 then does it matter whether the original pointer is at least align 2 for a load of v8i16?

I haven't looked at alignment requirements much. If there's another related transform that we can use as a template, let me know.

spatel: That's an interesting question. I think what we're doing here (even without this patch) is…

lebedev.riUnsubmitted

Done

I think we don't need to ever ask isSafeToLoadUnconditionally() about alignment
(i.e. just pass 0/1, whatever is more permissive),
because we already know the alignment of the load.

We simply need to recalculate the alignment after chopping off the offset
(see commonAlignment(Align A, uint64_t Offset))

lebedev.ri: I think we don't need to ever ask `isSafeToLoadUnconditionally()` about alignment (i.e. just…

// We want to shuffle the result down from a high element of a vector, so

lebedev.riUnsubmitted

Done

I think this should be done later, after cheaper checks.

lebedev.ri: I think this should be done later, after cheaper checks.

// the offset must be positive.

if (Offset.isNegative())

return false;

lebedev.riUnsubmitted

Done

Is there a test coverage for edge case[s]?
Doesn't this check only ensure that we can load a single byte, not the entire element?

lebedev.ri: Is there a test coverage for edge case[s]? Doesn't this check only ensure that we can load a…

// The offset must be a multiple of the scalar element to shuffle cleanly

// in the element's size.

lebedev.riUnsubmitted

Done

Don't we need to ensure that the byte offset is a multiple of element size?

lebedev.ri: Don't we need to ensure that the byte offset is a multiple of element size?

spatelAuthorUnsubmitted

Done

Yes, good catch. We have potentially looked through casts at this point, so anything is possible. I'll add test(s).

spatel: Yes, good catch. We have potentially looked through casts at this point, so anything is…

uint64_t ScalarSizeInBytes = ScalarSize / 8;

lebedev.riUnsubmitted

Done

So we had SrcPtr, and split it into a base pointer SrcPtr', and an offset Offset.
But i think it's SrcPtr = SrcPtr' + Offset,
so shouldn't the offset be negative here,
because the alignment is known for SrcPtr, not SrcPtr'?

lebedev.ri: So we had `SrcPtr`, and split it into a base pointer `SrcPtr'`, and an offset `Offset`. But i…

spatelAuthorUnsubmitted

Done

Good point!
Intuitively, I couldn't see how it would make a difference if we negated the offset, so I looked at the implementation of MinAlign():
https://github.com/llvm/llvm-project/blob/4c8276cdc120c24410dcd62a9986f04e7327fc2f/llvm/include/llvm/Support/MathExtras.h#L673

The optimized IR for that is:

define i64 @MinAlign(i64 %A, i64 %B) local_unnamed_addr #0 {
entry:
  %or = or i64 %B, %A
  %add = sub i64 0, %or
  %and = and i64 %or, %add
  ret i64 %and
}

Double-check to make sure I didn't mess this up:
https://alive2.llvm.org/ce/z/wRSjPD

So it's logically equivalent either way? At the least, I'll add a code comment.

spatel: Good point! Intuitively, I couldn't see how it would make a difference if we negated the offset…

lebedev.riUnsubmitted

Done

STGM

lebedev.ri: STGM

if (Offset.urem(ScalarSizeInBytes) != 0)

return false;

// If we load MinVecNumElts, will our target element still be loaded?

OffsetEltIndex = Offset.udiv(ScalarSizeInBytes).getZExtValue();

if (OffsetEltIndex >= MinVecNumElts)

return false;

lebedev.riUnsubmitted

Done

return false;

- // The offset must be within a vector-length to allow shuffling into place.

- if (!Offset.ult(MinVectorSize / 8))

+ unsigned NumElementsInOffset = Offset.udiv(ScalarSize / 8).getZExtValue();

+ // If we load MinVecNumElts elements, will our target elt still be loaded?

+ if (NumElementsInOffset >= MinVecNumElts)

return false;

// Update alignment with offset value. Note that the offset could be negated

We want to load a single element of X bytes,
by instead loading Y such elements at once.
We've dissected the pointer into a base pointer, and a byte offset.
We know that the byte offset is a multiple of element size.
We need to ensure that if we load Y elements from the base pointer,
we still load the element we are after.
I think approaching that check from byte count is highly confusing.
(at least i already spent too much time trying to check this)
Let's do a much more obvious, element count based check.

lebedev.ri: We want to load a single element of X bytes, by instead loading Y such elements at once. We've…

if (!isSafeToLoadUnconditionally(SrcPtr, MinVecTy, Align(1), DL, Load, &DT)) if (!isSafeToLoadUnconditionally(SrcPtr, MinVecTy, Align(1), DL, Load, &DT))

return false; return false;

// Update alignment with offset value. Note that the offset could be negated

// to more accurately represent "(new) SrcPtr - Offset = (old) SrcPtr", but

// negation does not change the result of the alignment calculation.

Alignment = commonAlignment(Alignment, Offset.getZExtValue());

}

lebedev.riUnsubmitted

Done

And now that we already calculate NumEltsInOffset,
shall we just use/record NumEltsInOffset instead of recalculating it from OffsetInBits later?
(It them might use a better name, something like ElementIndex maybe)

lebedev.ri: And now that we already calculate `NumEltsInOffset`, shall we just use/record `NumEltsInOffset`…

spatelAuthorUnsubmitted

Done

Yes - no need to carry the offset as both an index and a bit value. Keeping both makes it harder to read. Thanks for the thorough review!

spatel: Yes - no need to carry the offset as both an index and a bit value. Keeping both makes it…

// Original pattern: insertelt undef, load [free casts of] PtrOp, 0 // Original pattern: insertelt undef, load [free casts of] PtrOp, 0

// Use the greater of the alignment on the load or its source pointer. // Use the greater of the alignment on the load or its source pointer.

Align Alignment = std::max(SrcPtr->getPointerAlignment(DL), Load->getAlign()); Alignment = std::max(SrcPtr->getPointerAlignment(DL), Alignment);

Type *LoadTy = Load->getType(); Type *LoadTy = Load->getType();

int OldCost = TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment, AS); int OldCost = TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment, AS);

APInt DemandedElts = APInt::getOneBitSet(MinVecNumElts, 0); APInt DemandedElts = APInt::getOneBitSet(MinVecNumElts, 0);

OldCost += TTI.getScalarizationOverhead(MinVecTy, DemandedElts, OldCost += TTI.getScalarizationOverhead(MinVecTy, DemandedElts,

/* Insert */ true, HasExtract); /* Insert */ true, HasExtract);

// New pattern: load VecPtr // New pattern: load VecPtr

int NewCost = TTI.getMemoryOpCost(Instruction::Load, MinVecTy, Alignment, AS); int NewCost = TTI.getMemoryOpCost(Instruction::Load, MinVecTy, Alignment, AS);

// Optionally, we are shuffling the loaded vector element(s) into place.

if (OffsetEltIndex)

NewCost += TTI.getShuffleCost(TTI::SK_PermuteSingleSrc, MinVecTy);

// We can aggressively convert to the vector form because the backend can // We can aggressively convert to the vector form because the backend can

// invert this transform if it does not result in a performance win. // invert this transform if it does not result in a performance win.

if (OldCost < NewCost) if (OldCost < NewCost)

return false; return false;

// It is safe and potentially profitable to load a vector directly: // It is safe and potentially profitable to load a vector directly:

// inselt undef, load Scalar, 0 --> load VecPtr // inselt undef, load Scalar, 0 --> load VecPtr

IRBuilder<> Builder(Load); IRBuilder<> Builder(Load);

Value *CastedPtr = Builder.CreateBitCast(SrcPtr, MinVecTy->getPointerTo(AS)); Value *CastedPtr = Builder.CreateBitCast(SrcPtr, MinVecTy->getPointerTo(AS));

Value *VecLd = Builder.CreateAlignedLoad(MinVecTy, CastedPtr, Alignment); Value *VecLd = Builder.CreateAlignedLoad(MinVecTy, CastedPtr, Alignment);

// Set everything but element 0 to undef to prevent poison from propagating // Set everything but element 0 to undef to prevent poison from propagating

// from the extra loaded memory. This will also optionally shrink/grow the // from the extra loaded memory. This will also optionally shrink/grow the

// vector from the loaded size to the output size. // vector from the loaded size to the output size.

// We assume this operation has no cost in codegen. // We assume this operation has no cost in codegen if there was no offset.

// Note that we could use freeze to avoid poison problems, but then we might // Note that we could use freeze to avoid poison problems, but then we might

// still need a shuffle to change the vector size. // still need a shuffle to change the vector size.

unsigned OutputNumElts = Ty->getNumElements(); unsigned OutputNumElts = Ty->getNumElements();

SmallVector<int, 16> Mask(OutputNumElts, UndefMaskElem); SmallVector<int, 16> Mask(OutputNumElts, UndefMaskElem);

Mask[0] = 0; assert(OffsetEltIndex < MinVecNumElts && "Address offset too big");

Mask[0] = OffsetEltIndex;

lebedev.riUnsubmitted

Done

Assert that the division is exact?

lebedev.ri: Assert that the division is exact?

lebedev.riUnsubmitted

Done

And also assert that Mask[0] < MinVecNumElts.

lebedev.ri: And also assert that `Mask[0] < MinVecNumElts`.

lebedev.riUnsubmitted

Done

(If we no longer recompute them here, the asserts can obviously go away to)

lebedev.ri: (If we no longer recompute them here, the asserts can obviously go away to)

VecLd = Builder.CreateShuffleVector(VecLd, Mask); VecLd = Builder.CreateShuffleVector(VecLd, Mask);

replaceValue(I, *VecLd); replaceValue(I, *VecLd);

++NumVecLoad; ++NumVecLoad;

return true; return true;

} }

/// Determine which, if any, of the inputs should be replaced by a shuffle /// Determine which, if any, of the inputs should be replaced by a shuffle

▲ Show 20 Lines • Show All 614 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 \| FileCheck %s			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 \| FileCheck %s --check-prefixes=CHECK,AVX2

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define float @matching_fp_scalar(float* align 16 dereferenceable(16) %p) {			define float @matching_fp_scalar(float* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @matching_fp_scalar(			; CHECK-LABEL: @matching_fp_scalar(
	; CHECK-NEXT: [[R:%.]] = load float, float [[P:%.*]], align 16			; CHECK-NEXT: [[R:%.]] = load float, float [[P:%.*]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	▲ Show 20 Lines • Show All 252 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret <8 x i16> [[R]]			; CHECK-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
	%s = load i16, i16* %gep, align 2			%s = load i16, i16* %gep, align 2
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

	; Negative test - can't safely load the offset vector, but could load+shuffle.			; Can't safely load the offset vector, but can load+shuffle if it is profitable.

	define <8 x i16> @gep01_load_i16_insert_v8i16_deref(<8 x i16>* align 16 dereferenceable(17) %p) {			define <8 x i16> @gep01_load_i16_insert_v8i16_deref(<8 x i16>* align 16 dereferenceable(17) %p) {
	; CHECK-LABEL: @gep01_load_i16_insert_v8i16_deref(			; SSE2-LABEL: @gep01_load_i16_insert_v8i16_deref(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1			; SSE2-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1
	; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 2			; SSE2-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 2
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; SSE2-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0
	; CHECK-NEXT: ret <8 x i16> [[R]]			; SSE2-NEXT: ret <8 x i16> [[R]]
				;
				; AVX2-LABEL: @gep01_load_i16_insert_v8i16_deref(
				; AVX2-NEXT: [[TMP1:%.]] = load <8 x i16>, <8 x i16> [[P:%.*]], align 16
				; AVX2-NEXT: [[R:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				spatelAuthorUnsubmitted Done Reply Inline Actions Note that the SSE2 cost model is conservatively giving: {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2pshuflw + 2pshufhw // + pshufd/unpck ...so that's why we do not vectorize the tests here. spatel: Note that the SSE2 cost model is conservatively giving: {TTI::SK_PermuteSingleSrc, MVT…
				; AVX2-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
	%s = load i16, i16* %gep, align 2			%s = load i16, i16* %gep, align 2
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

	; TODO: Verify that alignment of the new load is not over-specified.			; Verify that alignment of the new load is not over-specified.

	define <8 x i16> @gep01_load_i16_insert_v8i16_deref_minalign(<8 x i16>* align 2 dereferenceable(16) %p) {			define <8 x i16> @gep01_load_i16_insert_v8i16_deref_minalign(<8 x i16>* align 2 dereferenceable(16) %p) {
	; CHECK-LABEL: @gep01_load_i16_insert_v8i16_deref_minalign(			; SSE2-LABEL: @gep01_load_i16_insert_v8i16_deref_minalign(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1			; SSE2-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1
	; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 8			; SSE2-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 8
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; SSE2-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0
	; CHECK-NEXT: ret <8 x i16> [[R]]			; SSE2-NEXT: ret <8 x i16> [[R]]
				;
				; AVX2-LABEL: @gep01_load_i16_insert_v8i16_deref_minalign(
				; AVX2-NEXT: [[TMP1:%.]] = load <8 x i16>, <8 x i16> [[P:%.*]], align 2
				; AVX2-NEXT: [[R:%.*]] = shufflevector <8 x i16> [[TMP1]], <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX2-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
	%s = load i16, i16* %gep, align 8			%s = load i16, i16* %gep, align 8
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

				; Negative test - if we are shuffling a load from the base pointer, the address offset
				; must be a multiple of element size.
				; TODO: Could bitcast around this limitation.

	define <4 x i32> @gep01_bitcast_load_i32_insert_v4i32(<16 x i8>* align 1 dereferenceable(16) %p) {			define <4 x i32> @gep01_bitcast_load_i32_insert_v4i32(<16 x i8>* align 1 dereferenceable(16) %p) {
	; CHECK-LABEL: @gep01_bitcast_load_i32_insert_v4i32(			; CHECK-LABEL: @gep01_bitcast_load_i32_insert_v4i32(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[P:%.*]], i64 0, i64 1			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[P:%.*]], i64 0, i64 1
	; CHECK-NEXT: [[B:%.]] = bitcast i8 [[GEP]] to i32*			; CHECK-NEXT: [[B:%.]] = bitcast i8 [[GEP]] to i32*
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 1			; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 1
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i64 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i64 0
	; CHECK-NEXT: ret <4 x i32> [[R]]			; CHECK-NEXT: ret <4 x i32> [[R]]
	;			;
	%gep = getelementptr inbounds <16 x i8>, <16 x i8>* %p, i64 0, i64 1			%gep = getelementptr inbounds <16 x i8>, <16 x i8>* %p, i64 0, i64 1
	%b = bitcast i8* %gep to i32*			%b = bitcast i8* %gep to i32*
	%s = load i32, i32* %b, align 1			%s = load i32, i32* %b, align 1
	%r = insertelement <4 x i32> undef, i32 %s, i64 0			%r = insertelement <4 x i32> undef, i32 %s, i64 0
	ret <4 x i32> %r			ret <4 x i32> %r
	}			}

	define <4 x i32> @gep012_bitcast_load_i32_insert_v4i32(<16 x i8>* align 1 dereferenceable(20) %p) {			define <4 x i32> @gep012_bitcast_load_i32_insert_v4i32(<16 x i8>* align 1 dereferenceable(20) %p) {
	; CHECK-LABEL: @gep012_bitcast_load_i32_insert_v4i32(			; CHECK-LABEL: @gep012_bitcast_load_i32_insert_v4i32(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[P:%.*]], i64 0, i64 12			; CHECK-NEXT: [[TMP1:%.]] = bitcast <16 x i8> [[P:%.]] to <4 x i32>
	; CHECK-NEXT: [[B:%.]] = bitcast i8 [[GEP]] to i32*			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 1
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 1			; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i64 0
	; CHECK-NEXT: ret <4 x i32> [[R]]			; CHECK-NEXT: ret <4 x i32> [[R]]
	;			;
	%gep = getelementptr inbounds <16 x i8>, <16 x i8>* %p, i64 0, i64 12			%gep = getelementptr inbounds <16 x i8>, <16 x i8>* %p, i64 0, i64 12
	%b = bitcast i8* %gep to i32*			%b = bitcast i8* %gep to i32*
	%s = load i32, i32* %b, align 1			%s = load i32, i32* %b, align 1
	%r = insertelement <4 x i32> undef, i32 %s, i64 0			%r = insertelement <4 x i32> undef, i32 %s, i64 0
	ret <4 x i32> %r			ret <4 x i32> %r
	}			}

				; Negative test - if we are shuffling a load from the base pointer, the address offset
				; must be a multiple of element size and the offset must be low enough to fit in the vector
				; (bitcasting would not help this case).

	define <4 x i32> @gep013_bitcast_load_i32_insert_v4i32(<16 x i8>* align 1 dereferenceable(20) %p) {			define <4 x i32> @gep013_bitcast_load_i32_insert_v4i32(<16 x i8>* align 1 dereferenceable(20) %p) {
	; CHECK-LABEL: @gep013_bitcast_load_i32_insert_v4i32(			; CHECK-LABEL: @gep013_bitcast_load_i32_insert_v4i32(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[P:%.*]], i64 0, i64 13			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[P:%.*]], i64 0, i64 13
	; CHECK-NEXT: [[B:%.]] = bitcast i8 [[GEP]] to i32*			; CHECK-NEXT: [[B:%.]] = bitcast i8 [[GEP]] to i32*
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 1			; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 1
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i64 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i64 0
	; CHECK-NEXT: ret <4 x i32> [[R]]			; CHECK-NEXT: ret <4 x i32> [[R]]
	;			;
	▲ Show 20 Lines • Show All 263 Lines • ▼ Show 20 Lines
	;			;
	%l = load <1 x i32>, <1 x i32>* %p, align 4			%l = load <1 x i32>, <1 x i32>* %p, align 4
	store <1 x i32> %l, <1 x i32>* %store_ptr			store <1 x i32> %l, <1 x i32>* %store_ptr
	%s = extractelement <1 x i32> %l, i32 0			%s = extractelement <1 x i32> %l, i32 0
	%r = insertelement <8 x i32> undef, i32 %s, i32 0			%r = insertelement <8 x i32> undef, i32 %s, i32 0
	ret <8 x i32> %r			ret <8 x i32> %r
	}			}

	; TODO: Can't safely load the offset vector, but can load+shuffle if it is profitable.			; Can't safely load the offset vector, but can load+shuffle if it is profitable.

	define <8 x i16> @gep1_load_v2i16_extract_insert_v8i16(<2 x i16>* align 1 dereferenceable(16) %p) {			define <8 x i16> @gep1_load_v2i16_extract_insert_v8i16(<2 x i16>* align 1 dereferenceable(16) %p) {
	; CHECK-LABEL: @gep1_load_v2i16_extract_insert_v8i16(			; SSE2-LABEL: @gep1_load_v2i16_extract_insert_v8i16(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <2 x i16>, <2 x i16> [[P:%.*]], i64 1			; SSE2-NEXT: [[GEP:%.]] = getelementptr inbounds <2 x i16>, <2 x i16> [[P:%.*]], i64 1
	; CHECK-NEXT: [[L:%.]] = load <2 x i16>, <2 x i16> [[GEP]], align 8			; SSE2-NEXT: [[L:%.]] = load <2 x i16>, <2 x i16> [[GEP]], align 8
	; CHECK-NEXT: [[S:%.*]] = extractelement <2 x i16> [[L]], i32 0			; SSE2-NEXT: [[S:%.*]] = extractelement <2 x i16> [[L]], i32 0
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; SSE2-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0
	; CHECK-NEXT: ret <8 x i16> [[R]]			; SSE2-NEXT: ret <8 x i16> [[R]]
				;
				; AVX2-LABEL: @gep1_load_v2i16_extract_insert_v8i16(
				; AVX2-NEXT: [[TMP1:%.]] = bitcast <2 x i16> [[P:%.]] to <8 x i16>
				; AVX2-NEXT: [[TMP2:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 4
				; AVX2-NEXT: [[R:%.*]] = shufflevector <8 x i16> [[TMP2]], <8 x i16> undef, <8 x i32> <i32 2, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				; AVX2-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <2 x i16>, <2 x i16>* %p, i64 1			%gep = getelementptr inbounds <2 x i16>, <2 x i16>* %p, i64 1
	%l = load <2 x i16>, <2 x i16>* %gep, align 8			%l = load <2 x i16>, <2 x i16>* %gep, align 8
	%s = extractelement <2 x i16> %l, i32 0			%s = extractelement <2 x i16> %l, i32 0
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] allow peeking through GEPs when creating a vector loadClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 312783

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

llvm/test/Transforms/VectorCombine/X86/load.ll

[VectorCombine] allow peeking through GEPs when creating a vector load
ClosedPublic