This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
-
Transforms/Vectorize/
-
Vectorize/
1
SLPVectorizer.h
-
lib/
-
Analysis/
-
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
39/44
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
1/4
loadi8.ll
-
memory-runtime-checks-in-loops.ll
-
memory-runtime-checks.ll
-
X86/
-
memory-runtime-checks.ll

Differential D102834

[SLP] Implement initial memory versioning.
AcceptedPublic

Authored by fhahn on May 20 2021, 2:46 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
dtemirbulatov
anton-afanasyev
ABataev
SjoerdMeijer

Summary

This patch is just an initial sketch to get a discussion going on how to
best support generating runtime checks for may-aliasing memory accesses.

The key question to start with is where to best collect and generate
runtime checks. Currently codegen for a block is eager; for each block,
if we find a vectorizable tree, we vectorize it and then analyze the
rest of the block. This makes it hard to collect *all* runtime checks
for a block before changing the code.

Perhaps for now we need to limit the checks to the first vectorizable
tree in a block? This is what the patch tries to do.

There are a couple of mechanical/technical issues that need to be
addressed, but I think the questions above are key to answer/address
first.

Other than that, the patch does not yet consider the cost of cloning
the block and the runtime checks. It also does not introduce phis
for values used outside the cloned block, so it will generate invalid IR
in those cases for now.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.May 20 2021, 2:46 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMay 20 2021, 2:46 AM

fhahn requested review of this revision.May 20 2021, 2:46 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 20 2021, 2:46 AM

Reviewers, this initial version is mostly intended as a first sketch to discuss how to best decide for which bundles/trees to collect checks and where to generate them. Any high-level thoughts/comments would be greatly appreciated.

ChuanqiXu added a subscriber: ChuanqiXu.May 20 2021, 2:50 AM

fhahn mentioned this in D102748: [LoopUnroll] Don't unroll before vectorisation.May 20 2021, 2:52 AM

Harbormaster completed remote builds in B105386: Diff 346674.May 20 2021, 4:27 AM

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

In D102834#2771002, @ABataev wrote:

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

That sounds like a good idea! The question is how to best decide for which accesses to generate RT checks. At the moment, I think that's only possible for the first vectorizable tree per BB we generate code for. Is it easy to get the first and last instruction in a vectorizable tree? Then we could just clone this range.

Yes, interesting. LoopVersioningLICM came to my mind, except that works on loops of course...

In D102834#2771013, @fhahn wrote:

In D102834#2771002, @ABataev wrote:

How about emitting proper aliasing metadata in the generated basicblocks? It should solve many problems with the SLP without extra changes and may help some other passes too. I mean, memaccesses in entry.slpversioned: should be noaliased and in entry.scalar: maybe aliased.

That sounds like a good idea! The question is how to best decide for which accesses to generate RT checks. At the moment, I think that's only possible for the first vectorizable tree per BB we generate code for. Is it easy to get the first and last instruction in a vectorizable tree? Then we could just clone this range.

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

Matt added a subscriber: Matt.May 20 2021, 8:08 AM

In D102834#2771053, @ABataev wrote:

...

The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

This is the main challenge I think. We could split the block at the first tree that requires runtime checks, but only generate runtime checks for all trees once we processed all trees for the block. If runtime checks are not profitable over all, we can remove the cloned block and re-process the block without versioning. May be a little expensive though.

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

In D102834#2771053, @ABataev wrote:

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

I updated the patch to only collect possible bounds for versioning first and queue basic blocks for which we found bounds for re-processing. When re-processing those blocks, we create a versioned block with the appropriate !noalias metadata and re-run vectorization (for now just seeded by stores). As as, this should be correct, but may not be optimal, because either:
a) there were issues preventing vectorization other than aliasing, which may cause us to not vectorize anything in the versioned block
b) the runtime checks are too expensive and offset the gain from additional vectorization.

We should be able to solve both by comparing the cost of the versioned & non-versioned BB after vectorization. If versioning is not profitable, we can remove the conditional branch again. What do you think of this direction?

In D102834#2793483, @SjoerdMeijer wrote:

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Harbormaster completed remote builds in B107962: Diff 350256.Jun 7 2021, 5:36 AM

In D102834#2802402, @fhahn wrote:

In D102834#2771053, @ABataev wrote:

You can easily get the first instruction - VectorizableTree.front(). Stores can be only in the front of the vectorizable tree. As to the last instruction, I think we need to scan all tree entries and find all gather nodes with memaccess which may alias with stores. The main problem that the versioning better to perform before we try to build the tree, otherwise we may decide to gather some instructions instead of trying vectorizing them because of possible aliasing. And it may not be profitable.

I updated the patch to only collect possible bounds for versioning first and queue basic blocks for which we found bounds for re-processing. When re-processing those blocks, we create a versioned block with the appropriate !noalias metadata and re-run vectorization (for now just seeded by stores). As as, this should be correct, but may not be optimal, because either:
a) there were issues preventing vectorization other than aliasing, which may cause us to not vectorize anything in the versioned block
b) the runtime checks are too expensive and offset the gain from additional vectorization.

We should be able to solve both by comparing the cost of the versioned & non-versioned BB after vectorization. If versioning is not profitable, we can remove the conditional branch again. What do you think of this direction?

Yeah, looks like this is the only way for now. Should be much easier to do with VPlan. Also, I would look at the instruction scheduling. Most probably, need to schedule the instructions more aggressively to improve locality and avoid regressions.

In D102834#2793483, @SjoerdMeijer wrote:

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

In D102834#2802402, @fhahn wrote:

In D102834#2793483, @SjoerdMeijer wrote:

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Nice one, excellent. What are your ideas on the micro-benchmarks? Is that something you'll be using internally, or were you e.g. thinking about adding things to the test suite. Just asking because I would be happy to contribute to that.

In D102834#2802512, @SjoerdMeijer wrote:

In D102834#2802402, @fhahn wrote:

In D102834#2793483, @SjoerdMeijer wrote:

Hi Florian, thanks for putting this up for review and starting the discussion. If you don't mind me asking, how high is this on your todo-list? The reason I am asking is that this well help x264 where we are behind a lot (and it more in general it solves this long outstanding issue that we have). Fixing x264 is high on our list, which is why I put up D102748 to start this discussion. If, for example, you don't see time to work on this, we could consider looking at it.

It's fairly high on my todo list, as we have quite a number of of such cases reported by our internal users. But I'm not planning to rush and I think we should also start collecting micro-benchmarks for the missed opportunities, so we can thoroughly evaluate the impact and do not introduce regressions in the future.

Nice one, excellent. What are your ideas on the micro-benchmarks? Is that something you'll be using internally, or were you e.g. thinking about adding things to the test suite. Just asking because I would be happy to contribute to that.

I was thinking about adding something like D101844, just with SLP examples.

Fix crash caused by not updating the DT's DFSNumbering and case where we checking for aliasing between pointers of the same underlying object.

Harbormaster completed remote builds in B108006: Diff 350319.Jun 7 2021, 9:00 AM

Update to remove deleted instructions early, so we do not pick up instructions to be deleted during SCEV expansion, flip branch targets.

Harbormaster completed remote builds in B108261: Diff 350679.Jun 8 2021, 11:46 AM

fhahn mentioned this in D104126: [MicroBenchmarks] Add initial SLP vectorization benchmarks..Jun 11 2021, 9:03 AM

Finished off the main functionality, update includes

computing & comparing the cost of the versioned block to the original block,
more tests,
preserving LI & DT if changes to the CFG are made.

We now also undo the CFG changes, if the versioning is not profitable. Currently for MultiSource, SPEC2000 & SPEC2006 with -O3 on X86/AVX2+AVX512, versioning is tried and deemed beneficial 123 times and deemed unprofitable 224 times. Hopefully we can reduce the failure rate in the future.

Note that the numbers include 2 improvements to SCEV to catch additional cases which I'll share soon. There's also some additional refactoring that needs doing.

I also put up D104126 add a first set of micro benchmarks that require SLP versioning. Those are some of my motivating cases and fairly simple, so any additions on that front would also be greatly appreciated.

Harbormaster completed remote builds in B108850: Diff 351502.Jun 11 2021, 11:03 AM

I would add an option to control whether to allow it or not.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1017	What about `EXPENSIVE_CHECKS` and `verifyFunction`?

Adjust preserved passes in the legacy PM, update pipeline tests.

The SCEV improvements to increase the number of memory ops vectorized are available now as D104319 & D104634. I'll start with some more cleanup/refactoring and adding the feature behind a flag as suggested soon.

Herald added subscribers: ormris, nikic. · View Herald TranscriptJun 21 2021, 11:34 AM

Harbormaster completed remote builds in B110252: Diff 353436.Jun 21 2021, 11:35 AM

ormris removed a subscriber: ormris.Jun 23 2021, 9:44 AM

fhahn mentioned this in D104634: [SCEV] Generalize MatchBinaryAddToConst to support non-add expressions..Jun 24 2021, 2:00 AM

fhahn mentioned this in D105481: [LAA] Remove RuntimeCheckingPtrGroup::RtCheck member (NFC)..Jul 6 2021, 7:11 AM

fhahn added a parent revision: D105481: [LAA] Remove RuntimeCheckingPtrGroup::RtCheck member (NFC)..Jul 6 2021, 7:12 AM

Rebased code on top of D105481, which allows to re-use the existing code to collect and generate runtime check code. Also added a flag to enable memory versioning (off by default for now).

fhahn retitled this revision from [SLPVectorizer] WIP Implement initial memory versioning (WIP!) to [SLPVectorizer] Implement initial memory versioning..Jul 6 2021, 7:13 AM

Harbormaster completed remote builds in B112612: Diff 356712.Jul 6 2021, 7:14 AM

ABataev added inline comments.Jul 6 2021, 8:08 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
792–798	Make it a static function?
815	Why need to insert something similar twice, here and below?
10543	`300` better to turn to option value.
10547	I think this can be simplified. This may consume a lot of time. Better to implement some kind of a simple traversal here and check if there are memaccesses in the operands, check if they may alias, and gather these aliases.
llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll
150–152	What's changed here? Just the code growths?

SjoerdMeijer added inline comments.Jul 6 2021, 8:25 AM

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll
90	This comment is out of date now.
150–152	Was wondering the same. To me it looks like we are emitting the runtime checks, but still not vectorising, which we probably want to avoid.

fhahn added inline comments.Jul 8 2021, 7:52 AM

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll
150–152	This is what happens if versioning was attempted, but did not enable SLP vectorisation for the block. Probably a case where some memory operations where excluded for some reason. I'll investigate. The versioned block has been deleted, but the dead runtime checks still remain, but those should be cleaned up by some later pass.

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

In D102834#2866603, @SjoerdMeijer wrote:

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

I also tried taking this patch and running the numbers for x264 SPEC. I'm also not seeing an improvement in performance. What loops are you specifically look at? I see that the loop you referenced in https://reviews.llvm.org/D102748 is still not vectorized with this patch.

fhahn mentioned this in rG6d753b0751b1: [LAA] Remove RuntimeCheckingPtrGroup::RtCheck member (NFC)..Jul 26 2021, 9:40 AM

In D102834#2866603, @SjoerdMeijer wrote:

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

With versioning, the SLP vectorizer generates the following vector block:

entry.slpversioned:                               ; preds = %entry.slpmemcheck
  %scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
  %0 = load i32, i32* %scale, align 16
  %offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
  %1 = load i32, i32* %offset, align 4
  %arrayidx.1 = getelementptr inbounds i8, i8* %src, i64 1
  %arrayidx2.1 = getelementptr inbounds i8, i8* %dst, i64 1
  %arrayidx.2 = getelementptr inbounds i8, i8* %src, i64 2
  %arrayidx2.2 = getelementptr inbounds i8, i8* %dst, i64 2
  %arrayidx.3 = getelementptr inbounds i8, i8* %src, i64 3
  %2 = bitcast i8* %src to <4 x i8>*
  %3 = load <4 x i8>, <4 x i8>* %2, align 1, !alias.scope !0, !noalias !3
  %4 = zext <4 x i8> %3 to <4 x i32>
  %5 = insertelement <4 x i32> poison, i32 %0, i32 0
  %6 = insertelement <4 x i32> %5, i32 %0, i32 1
  %7 = insertelement <4 x i32> %6, i32 %0, i32 2
  %8 = insertelement <4 x i32> %7, i32 %0, i32 3
  %9 = mul nsw <4 x i32> %8, %4
  %10 = insertelement <4 x i32> poison, i32 %1, i32 0
  %11 = insertelement <4 x i32> %10, i32 %1, i32 1
  %12 = insertelement <4 x i32> %11, i32 %1, i32 2
  %13 = insertelement <4 x i32> %12, i32 %1, i32 3
  %14 = add nsw <4 x i32> %9, %13
  %15 = icmp ult <4 x i32> %14, <i32 256, i32 256, i32 256, i32 256>
  %16 = icmp sgt <4 x i32> %14, zeroinitializer
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %17
  %19 = trunc <4 x i32> %18 to <4 x i8>
  %arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3
  %20 = bitcast i8* %dst to <4 x i8>*
  store <4 x i8> %19, <4 x i8>* %20, align 1, !alias.scope !3, !noalias !0
  br label %entry.merge

The problem there is that currently the cost of the vector block is compared to the cost of the original scalar block. In the case at hand the vectorized IR is not optimal and to cost gets over-estimated, causing the versioned block to be dropped as unprofitable.

We can tackle that in different ways: a) have the SLP vectorizer generated more optimal vector IR (e.g. shuffle instead of chain of inserts) or b) post-process the IR before computing the cost a bit. Not sure how much work a) would be. @ABataev any ideas?

In D102834#2914099, @fhahn wrote:
In D102834#2866603, @SjoerdMeijer wrote:

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

With versioning, the SLP vectorizer generates the following vector block:
entry.slpversioned:                               ; preds = %entry.slpmemcheck
  %scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
  %0 = load i32, i32* %scale, align 16
  %offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
  %1 = load i32, i32* %offset, align 4
  %arrayidx.1 = getelementptr inbounds i8, i8* %src, i64 1
  %arrayidx2.1 = getelementptr inbounds i8, i8* %dst, i64 1
  %arrayidx.2 = getelementptr inbounds i8, i8* %src, i64 2
  %arrayidx2.2 = getelementptr inbounds i8, i8* %dst, i64 2
  %arrayidx.3 = getelementptr inbounds i8, i8* %src, i64 3
  %2 = bitcast i8* %src to <4 x i8>*
  %3 = load <4 x i8>, <4 x i8>* %2, align 1, !alias.scope !0, !noalias !3
  %4 = zext <4 x i8> %3 to <4 x i32>
  %5 = insertelement <4 x i32> poison, i32 %0, i32 0
  %6 = insertelement <4 x i32> %5, i32 %0, i32 1
  %7 = insertelement <4 x i32> %6, i32 %0, i32 2
  %8 = insertelement <4 x i32> %7, i32 %0, i32 3
  %9 = mul nsw <4 x i32> %8, %4
  %10 = insertelement <4 x i32> poison, i32 %1, i32 0
  %11 = insertelement <4 x i32> %10, i32 %1, i32 1
  %12 = insertelement <4 x i32> %11, i32 %1, i32 2
  %13 = insertelement <4 x i32> %12, i32 %1, i32 3
  %14 = add nsw <4 x i32> %9, %13
  %15 = icmp ult <4 x i32> %14, <i32 256, i32 256, i32 256, i32 256>
  %16 = icmp sgt <4 x i32> %14, zeroinitializer
  %17 = sext <4 x i1> %16 to <4 x i32>
  %18 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %17
  %19 = trunc <4 x i32> %18 to <4 x i8>
  %arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3
  %20 = bitcast i8* %dst to <4 x i8>*
  store <4 x i8> %19, <4 x i8>* %20, align 1, !alias.scope !3, !noalias !0
  br label %entry.merge
The problem there is that currently the cost of the vector block is compared to the cost of the original scalar block. In the case at hand the vectorized IR is not optimal and to cost gets over-estimated, causing the versioned block to be dropped as unprofitable.

We can tackle that in different ways: a) have the SLP vectorizer generated more optimal vector IR (e.g. shuffle instead of chain of inserts) or b) post-process the IR before computing the cost a bit. Not sure how much work a) would be. @ABataev any ideas?

We can easily generate shuffles for splats/reused scalars, will implement this ASAP.

fhahn mentioned this in rT9eda02822fb3: [MicroBenchmarks] Add initial SLP vectorization benchmarks..Jul 30 2021, 2:02 AM

fhahn added a parent revision: D107104: [SLP]Improve splats vectorization..Jul 30 2021, 7:20 AM

In D102834#2914099, @fhahn wrote:

In D102834#2866603, @SjoerdMeijer wrote:

Just a bit of a heads up that I took this patch (and it's dependencies) and run some numbers for x264 SPEC, where I have seen quite a few missed opportunities caused by inability to emit runtime alias checks. I don't see any change in performance though, which is slightly unexpected (I was hoping for some already). But it might be that the next thing is blocking SLP vectorisation, for AArch64 which is what I am looking at, and that might be cost-model issues. This is the case for the 2 examples I am currently looking at, but will do some more analysis. And we of course need this patch as an enabler.

The case from x264 is in the function @f_alias in ../AArch64/loadi8.ll, right?

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

junparser added a subscriber: junparser.Aug 2 2021, 1:04 AM

Some major restructuring of the code for the basic block handling and removed some restrictions. The code now keeps track of the first and last instruction that uses a tracked object and only duplicates the instructions in between. The blocks are split in a way that allows us to roll back the changes to the original CFG. The runtime checks are now only generated if profitable. This has the drawback that we have to estimate the cost a bit differently. But it has the advantage that we only need to change the function if we vectorize.

The compile-time impact looks reasonable, worst geomean change is +0.09 for NewPM-ReleaseLTO-g http://llvm-compile-time-tracker.com/compare.php?from=7c61447052351b722c4d5aa4254cca9054a0be79&to=831b8bc299ef904a9980e59f59eab936d93bc9e1&stat=instructions .

We probably also can add additional heuristics to avoid attempting versioning in a few more cases.

The patch still needs additional comments, but I think the overall structure should be a good start now.

In D102834#2919138, @SjoerdMeijer wrote:

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

Harbormaster completed remote builds in B119123: Diff 365819.Aug 11 2021, 12:38 PM

In D102834#2940056, @fhahn wrote:

In D102834#2919138, @SjoerdMeijer wrote:

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

I believe that was function mc_weight_w20().

And like I mentioned earlier, I expect this change to make quite some differences for x264. For example, I hope it will trigger on quant_4x4() too, although additional trick may be required for successful SLP vectorisation of that example (cost-model and or other things).

Rebased and improved runtime check generation: 1) do not generated checks between 2 read-only groups and 2) skip overflow checks because we know that the last element is dereferenced (Alive2 agrees but I will double check if that is intentional).

In D102834#2941662, @SjoerdMeijer wrote:

In D102834#2940056, @fhahn wrote:

In D102834#2919138, @SjoerdMeijer wrote:

Sorry for the late reply, but just wanted to confirm this: yes, that @f_alias in ../AArch64/loadi8.ll is a reproducer from x264.

Great thanks. The code in the test should be transformed now. If you point me to the C code, I can check if it is transformed as expected now.

I believe that was function mc_weight_w20().

Hm interesting. The latest version should vectorize mc_weight_w20 on AArch64. But there's no measurable speedup from that unfortunately on the hardware I have access to. I also cannot measure any speedup if I add restrict to the mc_weight_w20 arguments, which should cause SLP vectorization without runtime checks. Is it possible I am missing something?

And like I mentioned earlier, I expect this change to make quite some differences for x264. For example, I hope it will trigger on quant_4x4() too, although additional trick may be required for successful SLP vectorisation of that example (cost-model and or other things).

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Harbormaster completed remote builds in B120195: Diff 367309.Aug 18 2021, 2:24 PM

mnadeem added a subscriber: mnadeem.Aug 18 2021, 5:54 PM

Anything else to be done? I see some commented code lines.

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Hi @fhahn, I tried cherry-picking this patch and I'm not seeing quant_4x4 getting vectorized. Maybe I'm missing something here. The commandline invocation I'm using is
clang --target=aarch64-linux-gnu -mllvm -slp-memory-versioning -mllvm -debug-only=SLP -O3 -c quant_4x4.c -S -o quant_4x4.S

Rebased, added back verifyFunction call, remove commented-out code. Also precommitted a bunch of tests which exposed crashes in earlier versions of the patch.

In D102834#2967556, @steplong wrote:

I am seeing 5-10% speedups when vectorizing quant_4x4. After runtime check generation, there's still another issue though (conditional load that's not hoisted out).

Hi @fhahn, I tried cherry-picking this patch and I'm not seeing quant_4x4 getting vectorized. Maybe I'm missing something here. The commandline invocation I'm using is
clang --target=aarch64-linux-gnu -mllvm -slp-memory-versioning -mllvm -debug-only=SLP -O3 -c quant_4x4.c -S -o quant_4x4.S

I think I measured this with modifying the source (hoisting the loads) and running the loop vectorizer. Even with runtime checks, we need to improve if-conversion for SLP as mentioned in https://lists.llvm.org/pipermail/llvm-dev/2021-September/152740.html. Also, the current patch does not support vectorizing reduction values that are returned, but that should be a relatively easy extension.

Harbormaster completed remote builds in B124230: Diff 372992.Sep 16 2021, 11:23 AM

a.elovikov added a subscriber: a.elovikov.Sep 17 2021, 11:01 AM

dnsampaio added a subscriber: dnsampaio.Sep 27 2021, 3:13 PM

@ABataev any suggestions on how to move forward with the patch?

In D102834#3029943, @fhahn wrote:

@ABataev any suggestions on how to move forward with the patch?

I am not @ABataev :-) but I see you're defaulting slp-memory-versioning to false and was wondering about this. That looks like a safe strategy. But why is that exactly, as opposed to e.g. just enabling this by default? Is that because it's easier to test things before flipping the switch? How/when do you plan to flip this switch?

Reverse ping. :-)

Any plans to pick this up, or how I could help with this? I have taken this patch, ran some testing and numbers, and can confirm previous observations so that's good. I could e.g. do a in-depth code review this week, which is something I haven't done yet if someone else doesn't get there first...

In D102834#3029951, @SjoerdMeijer wrote:

In D102834#3029943, @fhahn wrote:

@ABataev any suggestions on how to move forward with the patch?

I am not @ABataev :-) but I see you're defaulting slp-memory-versioning to false and was wondering about this. That looks like a safe strategy. But why is that exactly, as opposed to e.g. just enabling this by default? Is that because it's easier to test things before flipping the switch? How/when do you plan to flip this switch?

My only concern was compile-time impact. Overall the impact does not look too bad and the cases where it increases the compile-time it seems to correspond to code changes (= additional vectorization)

http://llvm-compile-time-tracker.com/compare.php?from=da7e033ee020a1708b98f3e4f159eef904a42600&to=cf03006253252dbe8a209bcb6d513f85bb5c036a&stat=instructions

In D102834#3069620, @SjoerdMeijer wrote:

Reverse ping. :-)

Any plans to pick this up, or how I could help with this? I have taken this patch, ran some testing and numbers, and can confirm previous observations so that's good. I could e.g. do a in-depth code review this week, which is something I haven't done yet if someone else doesn't get there first...

There were 2 more crashes I uncovered during me testing which should now be addressed. I also landed the outstanding tests, so it should be easier to apply the patch.

Harbormaster completed remote builds in B129996: Diff 381334.Oct 21 2021, 11:11 AM

I have now read the code for the first time, and here's a first round of comments.
I definitely need to read this again, which is what I will do soon...

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
190	This is a nitpick, bikeshedding names... but I was wondering about the terminology here (i.e. memory versioning). We emit runtime memory checks, and do versioning of a block. So was wondering if this should be BlockVersioning.
9488	Remove this?
10110–10463	Somewhere a high-level idea and description of the algorithm would be nice, I guess that would be here.
10117	Typo? delete -> deleted.
10137	Unused?
10139	That wasn't obvious to me, why is Start dereferenced?
10140	Do we want to link to alive (not sure how persistent the data is and if it's common), or just expand the example here?
10167	Could `std::next` return null (or end)?
10205	Don't think I have seen a block name with the .tail suffix in the tests.
10241	typo: versiond -> versioned.
10244	I am not that familiar with this, but do we expect this in one of the tests?
10329	Perhaps create a local helper function to handle this case.
10330	Re: "or possible", I would guess it's possible but not profitable?
10357	And another helper for this case too.

Relaying a message from @dmgreen about the emitted runtime checks: could they tank performance in some cases? For example, in (nested) loops? Or are they trivially hoisted if they are emitted in a loop nest? Do we have any idea or data about this?

Hi @fhahn, I wanted to check if this is still a priority for you? It is for us, this will enable a lot of things in x264, so am keen to progress this one way or another. We could continue this work, although it looks like we are nearly there...

ABataev added inline comments.Dec 1 2021, 5:08 AM

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
58–59	Convert this to enum? + add comments
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
797	Comments? Also, put it to a namespace
798–802	Default initializers and defaulted constructor?
811	`isValidElementType`?
815	`isValidElementType`?
10251–10257	Move out of the loop. Also, similar lambda is used in another place, worth creating a static function instead of lambdas.
10328	Do you consider the cost of all new check instructions as `1` here?

In D102834#3163803, @SjoerdMeijer wrote:

Hi @fhahn, I wanted to check if this is still a priority for you? It is for us, this will enable a lot of things in x264, so am keen to progress this one way or another. We could continue this work, although it looks like we are nearly there...

Unfortunately I missed that I didn't update the patch after the latest comments! I'll update it today.

Thanks!

Address latest sets of comments, thanks! I hope I didn't miss anything.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
190	My intention with the naming here was to be clear about why/what we version; the versioning is based on the memory accesses, we will create one version with the original memory access and a second version where memory accesses to different underlying objects won't alias. I'm not sure about tying this directly to the scope we support at the moment (single block). We might want to extend the scope in the future. For example, consider a loop which contains a block to version, where each access also depends on an induction variable. We might want to try to push the checks outside the loop, by widening them to encompass the whole iteration space, so we only need to check once.
797	put into an anonymous namespace and added comments
798–802	remove the `AccessInfo()` constructor, added default values instead
811	updated, thanks
9488	it's gone now
10110–10463	I added an outline at the top of the function.
10137	I think this is out of date, AS and WrittenObjs should be used.
10139	We get the `Start` expression from the pointer operand of either a store or load, so it should always be dereference by those instructions.
10140	I'll remove it.
10167	I don't think so; both `LastTracked` and `FirstTracked` should point to loads or stores, so there should always be a next instruction (the terminator). If they would be 0, there shouldn't be any bounds (MemBounds should be empty).
10205	This split is mainly there to make it easier to split off only the region containing from first tracked to last tracked. It is later folded back into the merge block.
10244	The domain? It is in the metadata nodes created, but the tests only check the `!noalias` attachments, not the definitions of the metadata at the moment.
10251–10257	updated to use `getLoadStorePointerOperand` from `Instructions.h`.
10328	Yes, it assumes each instruction costs `1`.
10329	I moved it to a function.
10330	I guess saying not beneficial is enough. When we reach here, it is always possible. Originally the `possible` bit was intended to refer to the case where we version but nothing gets vectorized because SLP fails due to other reasons. That's not really clear though, I'll remove it.
10357	I think moving this to a separate function would require a lot of arguments and the code here is quite compact.

Harbormaster completed remote builds in B137655: Diff 392057.Dec 6 2021, 7:27 AM

In D102834#3173543, @fhahn wrote:

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

In D102834#3173674, @SjoerdMeijer wrote:

In D102834#3173543, @fhahn wrote:

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

Yes, there could be negative performance impact, for at least the following reasons 1) cost-model bugs overestimating the vector savings and 2)if the vector path is never taken because the runtime checks fail. Cases of 1) will be rather easy to fix, while there won't be too much we can do about 2) unfortunately, unless we have profile info. Note that we have the same issue with LV.

In D102834#3174010, @fhahn wrote:

In D102834#3173674, @SjoerdMeijer wrote:

In D102834#3173543, @fhahn wrote:

Address latest sets of comments, thanks! I hope I didn't miss anything.

Thanks for that.

Not directly related to code changes, but more general question: could we see negative performance impact of these runtime checks? For example, if they are emitted in loops? Or are they trivially hoisted out of loops? Any thoughts on these kind of things?

Yes, there could be negative performance impact, for at least the following reasons 1) cost-model bugs overestimating the vector savings and 2)if the vector path is never taken because the runtime checks fail. Cases of 1) will be rather easy to fix, while there won't be too much we can do about 2) unfortunately, unless we have profile info. Note that we have the same issue with LV.

Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:

static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,

If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

Reverse ping :-)
Any thoughts on my previous comment?

fhahn retitled this revision from [SLPVectorizer] Implement initial memory versioning. to [SLP] Implement initial memory versioning..Dec 16 2021, 1:50 AM

In D102834#3175761, @SjoerdMeijer wrote:
Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:
static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,
If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.

I agree it would be good to start with this enabled by default! My main concern at the moment is compile-time impact. I need to re-measure, but from last time I remember that there were a few CTMark cases that had noticeable increases.

I think the rate of when versioning leads to actually vectorizing the versioned block at the moment is ~15%. I think for it to be enabled by default it would be good to increase this rate first. At the moment the reason this rate is so low is that we consider a block for versioning once we found an aliasing access and effectively stop analyzing further. So even if there are other problems preventing vectorization, we still try to version it.

In D102834#3197066, @fhahn wrote:
In D102834#3175761, @SjoerdMeijer wrote:
Yeah okay, sure, fair enough. Just wanted to double check, and you haven't seen any performance disasters in Spec or the test suite?
I am thinking if we should not just start with this enabled:
static cl::opt<bool> EnableMemoryVersioning(
    "slp-memory-versioning", cl::init(false), cl::Hidden,
If we get perf regression reports, this can be kept in tree but with this flag off, then at least we directly know what to fix.
I agree it would be good to start with this enabled by default! My main concern at the moment is compile-time impact. I need to re-measure, but from last time I remember that there were a few CTMark cases that had noticeable increases.

Ok, that would indeed be good to double-check.

I think the rate of when versioning leads to actually vectorizing the versioned block at the moment is ~15%. I think for it to be enabled by default it would be good to increase this rate first. At the moment the reason this rate is so low is that we consider a block for versioning once we found an aliasing access and effectively stop analyzing further. So even if there are other problems preventing vectorization, we still try to version it.

Ok, I see. I suspect there will be a bit of a long tail of work anyway. But as this is such a crucial enabler for any further work in this area, I would be inclined to start with this on by default and "see what happens". If it manages to be enabled by default, that would be so much more ideal...

I am going to look over this one more time, but think I am going to LGTM this soon (requesting another lgtm too).

This looks generally good to me, but it would be good to get a blessing from @ABataev too.

Ideally I would like to see two things changed before this is committed:

Start with this enabled by default, like I also mentioned in previous comments,
I added a comment about the style of vectorizeBlockWithVersioning, which is too big for my liking, see also the comment inlined. I think that could do with a refactoring, and it was almost a blocker for me, but not entirely. :-)

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
10133	Style: this function is nearly 300 lines, which I think is big for a function, and it is already not the easiest function to read. I would prefer to see this split up and using helper functions.
10521	Make this an internal option, or a `static constexpr`.

This revision is now accepted and ready to land.Dec 23 2021, 6:25 AM

Hey, it's me again. :)
Any plans to pick this up, shall we get this in?

In D102834#3258564, @SjoerdMeijer wrote:

Hey, it's me again. :)
Any plans to pick this up, shall we get this in?

Hi, I am still planning to push this in, but I don't think it should happen before the branch for the next release which will happen soon.

I also would like to wrap up D109368 first, which applies one of the fundamental ideas of this patch (optimistically generated RT checks, undo if not profitable) to LV and has in fact been going on for longer. There were a couple of inefficiencies in some of the generated runtime checks, which meant we did not estimate their cost accurately. Those should all be fixed now, and I hope D109368 can go in soon after the branch. Once it has been in tree for a couple of weeks, we should proceed with this patch here IMO.

Okay, thanks, that sounds like a good plan!

Allen added a subscriber: Allen.May 14 2022, 6:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 14 2022, 6:07 AM

Herald added a subscriber: vporpo. · View Herald Transcript

zhaopeng added a subscriber: zhaopeng.Aug 10 2022, 6:12 PM

Herald added a subscriber: nlopes. · View Herald TranscriptAug 10 2022, 6:12 PM

Rebase and intermediate update to switch to using more lightweight checks. This needs a few more constraints in places to account for the more lightweight checks, but should be workable for runtime testing if anybody is interested.

Harbormaster completed remote builds in B182911: Diff 454933.Aug 23 2022, 12:54 PM

khchen added a subscriber: khchen.Oct 22 2022, 10:04 AM

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptOct 22 2022, 10:04 AM

This needed a rebase. Not sure I missed anything doing that, but clang runs in an assert compiling x264.

xbolva00 mentioned this in D156532: [Pipelines] Perform hoisting prior to GVN.Aug 8 2023, 9:00 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

5 lines

Transforms/

Vectorize/

SLPVectorizer.h

23 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

2 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

541 lines

test/

Transforms/

SLPVectorizer/

AArch64/

loadi8.ll

62 lines

memory-runtime-checks-in-loops.ll

65 lines

memory-runtime-checks.ll

184 lines

X86/

memory-runtime-checks.ll

47 lines

Diff 454933

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

	Show First 20 Lines • Show All 329 Lines • ▼ Show 20 Lines

	class RuntimePointerChecking;			class RuntimePointerChecking;
	/// A grouping of pointers. A single memcheck is required between			/// A grouping of pointers. A single memcheck is required between
	/// two groups.			/// two groups.
	struct RuntimeCheckingPtrGroup {			struct RuntimeCheckingPtrGroup {
	/// Create a new pointer checking group containing a single			/// Create a new pointer checking group containing a single
	/// pointer, with index \p Index in RtCheck.			/// pointer, with index \p Index in RtCheck.
	RuntimeCheckingPtrGroup(unsigned Index, RuntimePointerChecking &RtCheck);			RuntimeCheckingPtrGroup(unsigned Index, RuntimePointerChecking &RtCheck);
				RuntimeCheckingPtrGroup(unsigned Index, const SCEV Start, const SCEV End,
				unsigned AS, bool NeedsFreeze)
				: High(End), Low(Start), AddressSpace(AS), NeedsFreeze(NeedsFreeze) {
				Members.push_back(Index);
				}

	/// Tries to add the pointer recorded in RtCheck at index			/// Tries to add the pointer recorded in RtCheck at index
	/// \p Index to this pointer checking group. We can only add a pointer			/// \p Index to this pointer checking group. We can only add a pointer
	/// to a checking group if we will still be able to get			/// to a checking group if we will still be able to get
	/// the upper and lower bounds of the check. Returns true in case			/// the upper and lower bounds of the check. Returns true in case
	/// of success, false otherwise.			/// of success, false otherwise.
	bool addPointer(unsigned Index, RuntimePointerChecking &RtCheck);			bool addPointer(unsigned Index, RuntimePointerChecking &RtCheck);
	bool addPointer(unsigned Index, const SCEV Start, const SCEV End,			bool addPointer(unsigned Index, const SCEV Start, const SCEV End,
	▲ Show 20 Lines • Show All 491 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
/// A private "module" namespace for types and utilities used by this pass.		/// A private "module" namespace for types and utilities used by this pass.
/// These are implementation details and should not be used by clients.		/// These are implementation details and should not be used by clients.
namespace slpvectorizer {		namespace slpvectorizer {

class BoUpSLP;		class BoUpSLP;

} // end namespace slpvectorizer		} // end namespace slpvectorizer

		struct SLPVectorizerResult {
		bool MadeAnyChange;
		bool MadeCFGChange;
		ABataevUnsubmitted Not Done Reply Inline Actions Convert this to enum? + add comments ABataev: Convert this to enum? + add comments

		SLPVectorizerResult(bool MadeAnyChange, bool MadeCFGChange)
		: MadeAnyChange(MadeAnyChange), MadeCFGChange(MadeCFGChange) {}
		};

struct SLPVectorizerPass : public PassInfoMixin<SLPVectorizerPass> {		struct SLPVectorizerPass : public PassInfoMixin<SLPVectorizerPass> {
using StoreList = SmallVector<StoreInst *, 8>;		using StoreList = SmallVector<StoreInst *, 8>;
using StoreListMap = MapVector<Value *, StoreList>;		using StoreListMap = MapVector<Value *, StoreList>;
using GEPList = SmallVector<GetElementPtrInst *, 8>;		using GEPList = SmallVector<GetElementPtrInst *, 8>;
using GEPListMap = MapVector<Value *, GEPList>;		using GEPListMap = MapVector<Value *, GEPList>;
using InstSetVector = SmallSetVector<Instruction *, 8>;		using InstSetVector = SmallSetVector<Instruction *, 8>;

ScalarEvolution *SE = nullptr;		ScalarEvolution *SE = nullptr;
TargetTransformInfo *TTI = nullptr;		TargetTransformInfo *TTI = nullptr;
TargetLibraryInfo *TLI = nullptr;		TargetLibraryInfo *TLI = nullptr;
AAResults *AA = nullptr;		AAResults *AA = nullptr;
LoopInfo *LI = nullptr;		LoopInfo *LI = nullptr;
DominatorTree *DT = nullptr;		DominatorTree *DT = nullptr;
AssumptionCache *AC = nullptr;		AssumptionCache *AC = nullptr;
DemandedBits *DB = nullptr;		DemandedBits *DB = nullptr;
const DataLayout *DL = nullptr;		const DataLayout *DL = nullptr;

public:		public:
PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

// Glue for old PM.		// Glue for old PM.
bool runImpl(Function &F, ScalarEvolution SE_, TargetTransformInfo TTI_,		SLPVectorizerResult runImpl(Function &F, ScalarEvolution *SE_,
TargetLibraryInfo TLI_, AAResults AA_, LoopInfo *LI_,		TargetTransformInfo *TTI_,
DominatorTree DT_, AssumptionCache AC_, DemandedBits *DB_,		TargetLibraryInfo TLI_, AAResults AA_,
		LoopInfo LI_, DominatorTree DT_,
		AssumptionCache AC_, DemandedBits DB_,
OptimizationRemarkEmitter *ORE_);		OptimizationRemarkEmitter *ORE_);

private:		private:
/// Collect store and getelementptr instructions and organize them		/// Collect store and getelementptr instructions and organize them
/// according to the underlying object of their pointer operands. We sort the		/// according to the underlying object of their pointer operands. We sort the
/// instructions by their underlying objects to reduce the cost of		/// instructions by their underlying objects to reduce the cost of
/// consecutive access queries.		/// consecutive access queries.
///		///
/// TODO: We can further reduce this cost if we flush the chain creation		/// TODO: We can further reduce this cost if we flush the chain creation
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	private:
/// a vectorization chain.		/// a vectorization chain.
bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);		bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);

bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,		bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,
unsigned Idx, unsigned MinVF);		unsigned Idx, unsigned MinVF);

bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);		bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);

		SLPVectorizerResult
		vectorizeBlockWithVersioning(BasicBlock *BB,
		const SmallPtrSetImpl<Value *> &TrackedObjects,
		slpvectorizer::BoUpSLP &R);

/// The store instructions in a basic block organized by base pointer.		/// The store instructions in a basic block organized by base pointer.
StoreListMap Stores;		StoreListMap Stores;

/// The getelementptr instructions in a basic block organized by base pointer.		/// The getelementptr instructions in a basic block organized by base pointer.
GEPListMap GEPs;		GEPListMap GEPs;
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_SLPVECTORIZER_H		#endif // LLVM_TRANSFORMS_VECTORIZE_SLPVECTORIZER_H

llvm/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 382 Lines • ▼ Show 20 Lines	return addPointer(
RtCheck.Pointers[Index].PointerValue->getType()->getPointerAddressSpace(),		RtCheck.Pointers[Index].PointerValue->getType()->getPointerAddressSpace(),
RtCheck.Pointers[Index].NeedsFreeze, *RtCheck.SE);		RtCheck.Pointers[Index].NeedsFreeze, *RtCheck.SE);
}		}

bool RuntimeCheckingPtrGroup::addPointer(unsigned Index, const SCEV *Start,		bool RuntimeCheckingPtrGroup::addPointer(unsigned Index, const SCEV *Start,
const SCEV *End, unsigned AS,		const SCEV *End, unsigned AS,
bool NeedsFreeze,		bool NeedsFreeze,
ScalarEvolution &SE) {		ScalarEvolution &SE) {
assert(AddressSpace == AS &&		assert((Members.empty() \|\| AddressSpace == AS) &&
"all pointers in a checking group must be in the same address space");		"all pointers in a checking group must be in the same address space");

// Compare the starts and ends with the known minimum and maximum		// Compare the starts and ends with the known minimum and maximum
// of this set. We need to know how we compare against the min/max		// of this set. We need to know how we compare against the min/max
// of the set in order to be able to emit memchecks.		// of the set in order to be able to emit memchecks.
const SCEV *Min0 = getMinFromExprs(Start, Low, &SE);		const SCEV *Min0 = getMinFromExprs(Start, Low, &SE);
if (!Min0)		if (!Min0)
return false;		return false;
▲ Show 20 Lines • Show All 2,329 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 30 Lines
#include "llvm/ADT/SmallString.h"		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/DemandedBits.h"		#include "llvm/Analysis/DemandedBits.h"
		#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/IVDescriptors.h"		#include "llvm/Analysis/IVDescriptors.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"		#include "llvm/Analysis/LoopAccessAnalysis.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/MemoryLocation.h"		#include "llvm/Analysis/MemoryLocation.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
Show All 10 Lines
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/InstrTypes.h"		#include "llvm/IR/InstrTypes.h"
#include "llvm/IR/Instruction.h"		#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Intrinsics.h"		#include "llvm/IR/Intrinsics.h"
		#include "llvm/IR/MDBuilder.h"
#include "llvm/IR/Module.h"		#include "llvm/IR/Module.h"
#include "llvm/IR/Operator.h"		#include "llvm/IR/Operator.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/IR/Type.h"		#include "llvm/IR/Type.h"
#include "llvm/IR/Use.h"		#include "llvm/IR/Use.h"
#include "llvm/IR/User.h"		#include "llvm/IR/User.h"
#include "llvm/IR/Value.h"		#include "llvm/IR/Value.h"
#include "llvm/IR/ValueHandle.h"		#include "llvm/IR/ValueHandle.h"
#ifdef EXPENSIVE_CHECKS		#ifdef EXPENSIVE_CHECKS
#include "llvm/IR/Verifier.h"		#include "llvm/IR/Verifier.h"
#endif		#endif
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include "llvm/Support/DOTGraphTraits.h"		#include "llvm/Support/DOTGraphTraits.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/GraphWriter.h"		#include "llvm/Support/GraphWriter.h"
#include "llvm/Support/InstructionCost.h"		#include "llvm/Support/InstructionCost.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
		#include "llvm/Transforms/Utils/Cloning.h"
#include "llvm/Transforms/Utils/InjectTLIMappings.h"		#include "llvm/Transforms/Utils/InjectTLIMappings.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
		#include "llvm/Transforms/Utils/ScalarEvolutionExpander.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>
#include <memory>		#include <memory>
#include <set>		#include <set>
#include <string>		#include <string>
#include <tuple>		#include <tuple>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;
using namespace slpvectorizer;		using namespace slpvectorizer;

#define SV_NAME "slp-vectorizer"		#define SV_NAME "slp-vectorizer"
#define DEBUG_TYPE "SLP"		#define DEBUG_TYPE "SLP"

STATISTIC(NumVectorInstructions, "Number of vector instructions generated");		STATISTIC(NumVectorInstructions, "Number of vector instructions generated");
		STATISTIC(NumVersioningSuccessful,
		"Number of times versioning was tried and beneficial");
		STATISTIC(NumVersioningFailed,
		"Number of times versioning was tried but was not beneficial");

cl::opt<bool> RunSLPVectorization("vectorize-slp", cl::init(true), cl::Hidden,		cl::opt<bool> RunSLPVectorization("vectorize-slp", cl::init(true), cl::Hidden,
cl::desc("Run the SLP vectorization passes"));		cl::desc("Run the SLP vectorization passes"));

static cl::opt<int>		static cl::opt<int>
SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,		SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,
cl::desc("Only vectorize if you gain more than this "		cl::desc("Only vectorize if you gain more than this "
"number "));		"number "));
▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
static cl::opt<int> RootLookAheadMaxDepth(		static cl::opt<int> RootLookAheadMaxDepth(
"slp-max-root-look-ahead-depth", cl::init(2), cl::Hidden,		"slp-max-root-look-ahead-depth", cl::init(2), cl::Hidden,
cl::desc("The maximum look-ahead depth for searching best rooting option"));		cl::desc("The maximum look-ahead depth for searching best rooting option"));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

		static cl::opt<bool> EnableMemoryVersioning(
		"slp-memory-versioning", cl::init(false), cl::Hidden,
		SjoerdMeijerUnsubmitted Done Reply Inline Actions This is a nitpick, bikeshedding names... but I was wondering about the terminology here (i.e. memory versioning). We emit runtime memory checks, and do versioning of a block. So was wondering if this should be BlockVersioning. SjoerdMeijer: This is a nitpick, bikeshedding names... but I was wondering about the terminology here (i.e.
		fhahnAuthorUnsubmitted Done Reply Inline Actions My intention with the naming here was to be clear about why/what we version; the versioning is based on the memory accesses, we will create one version with the original memory access and a second version where memory accesses to different underlying objects won't alias. I'm not sure about tying this directly to the scope we support at the moment (single block). We might want to extend the scope in the future. For example, consider a loop which contains a block to version, where each access also depends on an induction variable. We might want to try to push the checks outside the loop, by widening them to encompass the whole iteration space, so we only need to check once. fhahn: My intention with the naming here was to be clear about why/what we version; the versioning is…
		cl::desc("Enable memory versioning for SLP vectorization."));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

// Another limit for the alias checks: The maximum distance between load/store		// Another limit for the alias checks: The maximum distance between load/store
// instructions where alias checks are done.		// instructions where alias checks are done.
// This limit is useful for very large basic blocks.		// This limit is useful for very large basic blocks.
static const unsigned MaxMemDepDistance = 160;		static const unsigned MaxMemDepDistance = 160;
▲ Show 20 Lines • Show All 583 Lines • ▼ Show 20 Lines

/// Reorders the list of scalars in accordance with the given \p Mask.		/// Reorders the list of scalars in accordance with the given \p Mask.
static void reorderScalars(SmallVectorImpl<Value *> &Scalars,		static void reorderScalars(SmallVectorImpl<Value *> &Scalars,
ArrayRef<int> Mask) {		ArrayRef<int> Mask) {
assert(!Mask.empty() && "Expected non-empty mask.");		assert(!Mask.empty() && "Expected non-empty mask.");
SmallVector<Value *> Prev(Scalars.size(),		SmallVector<Value *> Prev(Scalars.size(),
UndefValue::get(Scalars.front()->getType()));		UndefValue::get(Scalars.front()->getType()));
Prev.swap(Scalars);		Prev.swap(Scalars);
for (unsigned I = 0, E = Prev.size(); I < E; ++I)		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
if (Mask[I] != UndefMaskElem)		if (Mask[I] != UndefMaskElem)
Scalars[Mask[I]] = Prev[I];		Scalars[Mask[I]] = Prev[I];
}		}

/// Checks if the provided value does not require scheduling. It does not		/// Checks if the provided value does not require scheduling. It does not
		ABataevUnsubmitted Done Reply Inline Actions Comments? Also, put it to a namespace ABataev: Comments? Also, put it to a namespace
		fhahnAuthorUnsubmitted Done Reply Inline Actions put into an anonymous namespace and added comments fhahn: put into an anonymous namespace and added comments
/// require scheduling if this is not an instruction or it is an instruction		/// require scheduling if this is not an instruction or it is an instruction
		ABataevUnsubmitted Done Reply Inline Actions Make it a static function? ABataev: Make it a static function?
/// that does not read/write memory and all operands are either not instructions		/// that does not read/write memory and all operands are either not instructions
/// or phi nodes or instructions from different blocks.		/// or phi nodes or instructions from different blocks.
static bool areAllOperandsNonInsts(Value *V) {		static bool areAllOperandsNonInsts(Value *V) {
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
		ABataevUnsubmitted Done Reply Inline Actions Default initializers and defaulted constructor? ABataev: Default initializers and defaulted constructor?
		fhahnAuthorUnsubmitted Done Reply Inline Actions remove the `AccessInfo()` constructor, added default values instead fhahn: remove the `AccessInfo()` constructor, added default values instead
if (!I)		if (!I)
return true;		return true;
return !mayHaveNonDefUseDependency(*I) &&		return !mayHaveNonDefUseDependency(*I) &&
all_of(I->operands(), [I](Value *V) {		all_of(I->operands(), [I](Value *V) {
auto *IO = dyn_cast<Instruction>(V);		auto *IO = dyn_cast<Instruction>(V);
if (!IO)		if (!IO)
return true;		return true;
return isa<PHINode>(IO) \|\| IO->getParent() != I->getParent();		return isa<PHINode>(IO) \|\| IO->getParent() != I->getParent();
});		});
		ABataevUnsubmitted Done Reply Inline Actions `isValidElementType`? ABataev: `isValidElementType`?
		fhahnAuthorUnsubmitted Done Reply Inline Actions updated, thanks fhahn: updated, thanks
}		}

/// Checks if the provided value does not require scheduling. It does not		/// Checks if the provided value does not require scheduling. It does not
/// require scheduling if this is not an instruction or it is an instruction		/// require scheduling if this is not an instruction or it is an instruction
		ABataevUnsubmitted Done Reply Inline Actions Why need to insert something similar twice, here and below? ABataev: Why need to insert something similar twice, here and below?
		ABataevUnsubmitted Done Reply Inline Actions `isValidElementType`? ABataev: `isValidElementType`?
/// that does not read/write memory and all users are phi nodes or instructions		/// that does not read/write memory and all users are phi nodes or instructions
/// from the different blocks.		/// from the different blocks.
static bool isUsedOutsideBlock(Value *V) {		static bool isUsedOutsideBlock(Value *V) {
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
if (!I)		if (!I)
return true;		return true;
// Limits the number of uses to save compile time.		// Limits the number of uses to save compile time.
constexpr int UsesLimit = 8;		constexpr int UsesLimit = 8;
Show All 17 Lines
/// It is so if all either instructions have operands that do not require		/// It is so if all either instructions have operands that do not require
/// scheduling or their users do not require scheduling since they are phis or		/// scheduling or their users do not require scheduling since they are phis or
/// in other basic blocks.		/// in other basic blocks.
static bool doesNotNeedToSchedule(ArrayRef<Value *> VL) {		static bool doesNotNeedToSchedule(ArrayRef<Value *> VL) {
return !VL.empty() &&		return !VL.empty() &&
(all_of(VL, isUsedOutsideBlock) \|\| all_of(VL, areAllOperandsNonInsts));		(all_of(VL, isUsedOutsideBlock) \|\| all_of(VL, areAllOperandsNonInsts));
}		}

		namespace {
		/// Models a memory access to an underlying object with SCEV pointer expression
		/// and access type.
		struct AccessInfo {
		Value *UnderlyingObj;
		const SCEV *PtrSCEV;
		Type *AccessTy;

		AccessInfo(Value UnderlyingObj = nullptr, const SCEV PtrSCEV = nullptr,
		Type *AccessTy = nullptr)
		: UnderlyingObj(UnderlyingObj), PtrSCEV(PtrSCEV), AccessTy(AccessTy) {}

		/// Returns the AccessInfo for \p I. If \p I isn't a memory instruction or the
		/// pointer cannot be converted to a SCEV, return an empty object.
		static AccessInfo get(Instruction &I, ScalarEvolution &SE,
		DominatorTree &DT) {
		BasicBlock *BB = I.getParent();
		auto GetPtrAndAccessTy = [](Instruction I) -> std::pair<Value , Type *> {
		if (auto *L = dyn_cast<LoadInst>(I)) {
		if (isValidElementType(L->getType()))
		return {L->getPointerOperand(), L->getType()};
		}
		if (auto *S = dyn_cast<StoreInst>(I))
		if (isValidElementType(S->getValueOperand()->getType()))
		return {S->getPointerOperand(), S->getValueOperand()->getType()};
		return {nullptr, nullptr};
		};
		Value *Ptr;
		Type *AccessTy;
		std::tie(Ptr, AccessTy) = GetPtrAndAccessTy(&I);
		if (!Ptr)
		return {};
		Value *Obj = getUnderlyingObject(Ptr);
		if (!Obj)
		return {};
		auto *Start = SE.getSCEV(Ptr);

		PHINode *PN = dyn_cast<PHINode>(Obj);
		if (!SE.properlyDominates(Start, BB) &&
		!(PN && DT.dominates(PN->getParent(), BB)))
		return {};
		return {Obj, Start, AccessTy};
		}
		};
		} // anonymous namespace

namespace slpvectorizer {		namespace slpvectorizer {

/// Bottom Up SLP Vectorizer.		/// Bottom Up SLP Vectorizer.
class BoUpSLP {		class BoUpSLP {
struct TreeEntry;		struct TreeEntry;
struct ScheduleData;		struct ScheduleData;

public:		public:
		/// Set of objects we need to generate runtime checks for.
		SmallPtrSet<Value *, 8> TrackedObjects;

		SmallSet<std::pair<Value , Value >, 8> DepObjs;

		/// Cache for alias results.
		/// TODO: consider moving this to the AliasAnalysis itself.
		using AliasCacheKey = std::pair<Instruction , Instruction >;
		DenseMap<AliasCacheKey, Optional<bool>> AliasCache;

		bool CollectMemAccess = false;

using ValueList = SmallVector<Value *, 8>;		using ValueList = SmallVector<Value *, 8>;
using InstrList = SmallVector<Instruction *, 16>;		using InstrList = SmallVector<Instruction *, 16>;
using ValueSet = SmallPtrSet<Value *, 16>;		using ValueSet = SmallPtrSet<Value *, 16>;
using StoreList = SmallVector<StoreInst *, 8>;		using StoreList = SmallVector<StoreInst *, 8>;
using ExtraValueToDebugLocsMap =		using ExtraValueToDebugLocsMap =
MapVector<Value , SmallVector<Instruction , 2>>;		MapVector<Value , SmallVector<Instruction , 2>>;
using OrdersType = SmallVector<unsigned, 4>;		using OrdersType = SmallVector<unsigned, 4>;

▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	public:

/// Gets reordering data for the given tree entry. If the entry is vectorized		/// Gets reordering data for the given tree entry. If the entry is vectorized
/// - just return ReorderIndices, otherwise check if the scalars can be		/// - just return ReorderIndices, otherwise check if the scalars can be
/// reordered and return the most optimal order.		/// reordered and return the most optimal order.
/// \param TopToBottom If true, include the order of vectorized stores and		/// \param TopToBottom If true, include the order of vectorized stores and
/// insertelement nodes, otherwise skip them.		/// insertelement nodes, otherwise skip them.
Optional<OrdersType> getReorderingData(const TreeEntry &TE, bool TopToBottom);		Optional<OrdersType> getReorderingData(const TreeEntry &TE, bool TopToBottom);

/// Reorders the current graph to the most profitable order starting from the		/// Reorders the current graph to the most profitable order starting from the
		ABataevUnsubmitted Not Done Reply Inline Actions What about `EXPENSIVE_CHECKS` and `verifyFunction`? ABataev: What about `EXPENSIVE_CHECKS` and `verifyFunction`?
/// root node to the leaf nodes. The best order is chosen only from the nodes		/// root node to the leaf nodes. The best order is chosen only from the nodes
/// of the same size (vectorization factor). Smaller nodes are considered		/// of the same size (vectorization factor). Smaller nodes are considered
/// parts of subgraph with smaller VF and they are reordered independently. We		/// parts of subgraph with smaller VF and they are reordered independently. We
/// can make it because we still need to extend smaller nodes to the wider VF		/// can make it because we still need to extend smaller nodes to the wider VF
/// and we can merge reordering shuffles with the widening shuffles.		/// and we can merge reordering shuffles with the widening shuffles.
void reorderTopToBottom();		void reorderTopToBottom();

/// Reorders the current graph to the most profitable order starting from		/// Reorders the current graph to the most profitable order starting from
/// leaves to the root. It allows to rotate small subgraphs and reduce the		/// leaves to the root. It allows to rotate small subgraphs and reduce the
/// number of reshuffles if the leaf nodes use the same order. In this case we		/// number of reshuffles if the leaf nodes use the same order. In this case we
/// can merge the orders and just shuffle user node instead of shuffling its		/// can merge the orders and just shuffle user node instead of shuffling its
/// operands. Plus, even the leaf nodes have different orders, it allows to		/// operands. Plus, even the leaf nodes have different orders, it allows to
/// sink reordering in the graph closer to the root node and merge it later		/// sink reordering in the graph closer to the root node and merge it later
/// during analysis.		/// during analysis.
void reorderBottomToTop(bool IgnoreReorder = false);		void reorderBottomToTop(bool IgnoreReorder = false);

		void removeDeletedInstructions() {
		for (auto *I : DeletedInstructions) {
		I->dropAllReferences();
		}
		for (auto *I : DeletedInstructions) {
		assert(I->use_empty() && "trying to erase instruction with users.");
		I->eraseFromParent();
		}
		DeletedInstructions.clear();
		}

/// \return The vector element size in bits to use when vectorizing the		/// \return The vector element size in bits to use when vectorizing the
/// expression tree ending at \p V. If V is a store, the size is the width of		/// expression tree ending at \p V. If V is a store, the size is the width of
/// the stored value. Otherwise, the size is the width of the largest loaded		/// the stored value. Otherwise, the size is the width of the largest loaded
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);

/// Compute the minimum type sizes required to represent the entries in a		/// Compute the minimum type sizes required to represent the entries in a
▲ Show 20 Lines • Show All 1,671 Lines • ▼ Show 20 Lines	bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,
bool aliased = true;		bool aliased = true;
if (Loc1.Ptr && isSimple(Inst1))		if (Loc1.Ptr && isSimple(Inst1))
aliased = isModOrRefSet(BatchAA.getModRefInfo(Inst2, Loc1));		aliased = isModOrRefSet(BatchAA.getModRefInfo(Inst2, Loc1));
// Store the result in the cache.		// Store the result in the cache.
result = aliased;		result = aliased;
return aliased;		return aliased;
}		}

using AliasCacheKey = std::pair<Instruction , Instruction >;

/// Cache for alias results.
/// TODO: consider moving this to the AliasAnalysis itself.
DenseMap<AliasCacheKey, Optional<bool>> AliasCache;

// Cache for pointerMayBeCaptured calls inside AA. This is preserved		// Cache for pointerMayBeCaptured calls inside AA. This is preserved
// globally through SLP because we don't perform any action which		// globally through SLP because we don't perform any action which
// invalidates capture results.		// invalidates capture results.
BatchAAResults BatchAA;		BatchAAResults BatchAA;

/// Temporary store for deleted instructions. Instructions will be deleted		/// Temporary store for deleted instructions. Instructions will be deleted
/// eventually when the BoUpSLP is destructed. The deferral is required to		/// eventually when the BoUpSLP is destructed. The deferral is required to
/// ensure that there are no incorrect collisions in the AliasCache, which		/// ensure that there are no incorrect collisions in the AliasCache, which
▲ Show 20 Lines • Show All 684 Lines • ▼ Show 20 Lines	for (auto *I : DeletedInstructions) {
for (Use &U : I->operands()) {		for (Use &U : I->operands()) {
auto *Op = dyn_cast<Instruction>(U.get());		auto *Op = dyn_cast<Instruction>(U.get());
if (Op && !DeletedInstructions.count(Op) && Op->hasOneUser() &&		if (Op && !DeletedInstructions.count(Op) && Op->hasOneUser() &&
wouldInstructionBeTriviallyDead(Op, TLI))		wouldInstructionBeTriviallyDead(Op, TLI))
DeadInsts.emplace_back(Op);		DeadInsts.emplace_back(Op);
}		}
I->dropAllReferences();		I->dropAllReferences();
}		}
for (auto *I : DeletedInstructions) {
assert(I->use_empty() &&
"trying to erase instruction with users.");
I->eraseFromParent();
}

// Cleanup any dead scalar code feeding the vectorized instructions		// Cleanup any dead scalar code feeding the vectorized instructions
RecursivelyDeleteTriviallyDeadInstructions(DeadInsts, TLI);		RecursivelyDeleteTriviallyDeadInstructions(DeadInsts, TLI);

		removeDeletedInstructions();
#ifdef EXPENSIVE_CHECKS		#ifdef EXPENSIVE_CHECKS
// If we could guarantee that this call is not extremely slow, we could		// If we could guarantee that this call is not extremely slow, we could
// remove the ifdef limitation (see PR47712).		// remove the ifdef limitation (see PR47712).
assert(!verifyFunction(*F, &dbgs()));		assert(!verifyFunction(*F, &dbgs()));
#endif		#endif
}		}

/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses		/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses
▲ Show 20 Lines • Show All 6,035 Lines • ▼ Show 20 Lines	for (ScheduleData *BundleMember = SD; BundleMember;
}		}
}		}
}		}

auto makeControlDependent = [&](Instruction *I) {		auto makeControlDependent = [&](Instruction *I) {
auto *DepDest = getScheduleData(I);		auto *DepDest = getScheduleData(I);
assert(DepDest && "must be in schedule window");		assert(DepDest && "must be in schedule window");
DepDest->ControlDependencies.push_back(BundleMember);		DepDest->ControlDependencies.push_back(BundleMember);
BundleMember->Dependencies++;		BundleMember->Dependencies++;
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Remove this? SjoerdMeijer: Remove this?
		fhahnAuthorUnsubmitted Done Reply Inline Actions it's gone now fhahn: it's gone now
ScheduleData *DestBundle = DepDest->FirstInBundle;		ScheduleData *DestBundle = DepDest->FirstInBundle;
if (!DestBundle->IsScheduled)		if (!DestBundle->IsScheduled)
BundleMember->incrementUnscheduledDeps(1);		BundleMember->incrementUnscheduledDeps(1);
if (!DestBundle->hasValidDependencies())		if (!DestBundle->hasValidDependencies())
WorkList.push_back(DestBundle);		WorkList.push_back(DestBundle);
};		};

// Any instruction which isn't safe to speculate at the beginning of the		// Any instruction which isn't safe to speculate at the beginning of the
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	for (ScheduleData *BundleMember = SD; BundleMember;
(numAliased >= AliasedCheckLimit \|\|		(numAliased >= AliasedCheckLimit \|\|
SLP->isAliased(SrcLoc, SrcInst, DepDest->Inst)))) {		SLP->isAliased(SrcLoc, SrcInst, DepDest->Inst)))) {

// We increment the counter only if the locations are aliased		// We increment the counter only if the locations are aliased
// (instead of counting all alias checks). This gives a better		// (instead of counting all alias checks). This gives a better
// balance between reduced runtime and accurate dependencies.		// balance between reduced runtime and accurate dependencies.
numAliased++;		numAliased++;

		ScheduleData *DestBundle = DepDest->FirstInBundle;
		// If this bundle is not scheduled and no versioned code has been
		// generated yet, try to collect the bounds of the accesses to
		// generate runtime checks.
		if (!DestBundle->IsScheduled && SLP->CollectMemAccess) {
		auto *Src = getLoadStorePointerOperand(SrcInst);
		auto *Dst = getLoadStorePointerOperand(DepDest->Inst);

		if (SrcInst->getParent() == DepDest->Inst->getParent() && Src &&
		Dst) {
		auto SrcObjAndPtr = AccessInfo::get(SrcInst, SLP->SE, *SLP->DT);
		auto DstObjAndPtr =
		AccessInfo::get(DepDest->Inst, SLP->SE, *SLP->DT);
		if (!SrcObjAndPtr.UnderlyingObj \|\| !DstObjAndPtr.UnderlyingObj \|\|
		SrcObjAndPtr.UnderlyingObj == DstObjAndPtr.UnderlyingObj)
		SLP->TrackedObjects.clear();
		else {
		SLP->TrackedObjects.insert(SrcObjAndPtr.UnderlyingObj);
		SLP->TrackedObjects.insert(DstObjAndPtr.UnderlyingObj);

		Value *A = SrcObjAndPtr.UnderlyingObj;
		Value *B = DstObjAndPtr.UnderlyingObj;
		if (A > B)
		std::swap(A, B);
		SLP->DepObjs.insert({A, B});
		}
		}
		}

DepDest->MemoryDependencies.push_back(BundleMember);		DepDest->MemoryDependencies.push_back(BundleMember);
BundleMember->Dependencies++;		BundleMember->Dependencies++;
ScheduleData *DestBundle = DepDest->FirstInBundle;
if (!DestBundle->IsScheduled) {		if (!DestBundle->IsScheduled) {
BundleMember->incrementUnscheduledDeps(1);		BundleMember->incrementUnscheduledDeps(1);
}		}
if (!DestBundle->hasValidDependencies()) {		if (!DestBundle->hasValidDependencies()) {
WorkList.push_back(DestBundle);		WorkList.push_back(DestBundle);
}		}
}		}

▲ Show 20 Lines • Show All 429 Lines • ▼ Show 20 Lines	bool runOnFunction(Function &F) override {
auto *TLI = TLIP ? &TLIP->getTLI(F) : nullptr;		auto *TLI = TLIP ? &TLIP->getTLI(F) : nullptr;
auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		auto *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		auto *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		auto *AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
auto *DB = &getAnalysis<DemandedBitsWrapperPass>().getDemandedBits();		auto *DB = &getAnalysis<DemandedBitsWrapperPass>().getDemandedBits();
auto *ORE = &getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();		auto *ORE = &getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();

return Impl.runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, ORE);		return Impl.runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, ORE).MadeAnyChange;
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<ScalarEvolutionWrapperPass>();		AU.addRequired<ScalarEvolutionWrapperPass>();
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<DemandedBitsWrapperPass>();		AU.addRequired<DemandedBitsWrapperPass>();
AU.addRequired<OptimizationRemarkEmitterWrapperPass>();		AU.addRequired<OptimizationRemarkEmitterWrapperPass>();
AU.addRequired<InjectTLIMappingsLegacy>();		AU.addRequired<InjectTLIMappingsLegacy>();
AU.addPreserved<LoopInfoWrapperPass>();		AU.addPreserved<LoopInfoWrapperPass>();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
		if (!EnableMemoryVersioning) {
		AU.addPreserved<AAResultsWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
}		}
		}
};		};

} // end anonymous namespace		} // end anonymous namespace

PreservedAnalyses SLPVectorizerPass::run(Function &F, FunctionAnalysisManager &AM) {		PreservedAnalyses SLPVectorizerPass::run(Function &F, FunctionAnalysisManager &AM) {
auto *SE = &AM.getResult<ScalarEvolutionAnalysis>(F);		auto *SE = &AM.getResult<ScalarEvolutionAnalysis>(F);
auto *TTI = &AM.getResult<TargetIRAnalysis>(F);		auto *TTI = &AM.getResult<TargetIRAnalysis>(F);
auto *TLI = AM.getCachedResult<TargetLibraryAnalysis>(F);		auto *TLI = AM.getCachedResult<TargetLibraryAnalysis>(F);
auto *AA = &AM.getResult<AAManager>(F);		auto *AA = &AM.getResult<AAManager>(F);
auto *LI = &AM.getResult<LoopAnalysis>(F);		auto *LI = &AM.getResult<LoopAnalysis>(F);
auto *DT = &AM.getResult<DominatorTreeAnalysis>(F);		auto *DT = &AM.getResult<DominatorTreeAnalysis>(F);
auto *AC = &AM.getResult<AssumptionAnalysis>(F);		auto *AC = &AM.getResult<AssumptionAnalysis>(F);
auto *DB = &AM.getResult<DemandedBitsAnalysis>(F);		auto *DB = &AM.getResult<DemandedBitsAnalysis>(F);
auto *ORE = &AM.getResult<OptimizationRemarkEmitterAnalysis>(F);		auto *ORE = &AM.getResult<OptimizationRemarkEmitterAnalysis>(F);

bool Changed = runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, ORE);		auto Result = runImpl(F, SE, TTI, TLI, AA, LI, DT, AC, DB, ORE);
if (!Changed)		if (!Result.MadeAnyChange)
return PreservedAnalyses::all();		return PreservedAnalyses::all();

PreservedAnalyses PA;		PreservedAnalyses PA;
		if (!Result.MadeCFGChange)
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
		PA.preserve<LoopAnalysis>();
		PA.preserve<DominatorTreeAnalysis>();
return PA;		return PA;
}		}

bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,		/// Restore the original CFG by removing \p VectorBB and folding \p CheckBB, \p
TargetTransformInfo *TTI_,		/// ScalarBB, \p MergeBB and \p Tail into a single block, like in the original
TargetLibraryInfo TLI_, AAResults AA_,		/// IR.
LoopInfo LI_, DominatorTree DT_,		static void undoVersionedBlocks(BasicBlock CheckBB, BasicBlock ScalarBB,
AssumptionCache AC_, DemandedBits DB_,		DomTreeUpdater &DTU, LoopInfo *LI,
OptimizationRemarkEmitter *ORE_) {		BasicBlock *VectorBB, StringRef OriginalBBName,
		BasicBlock MergeBB, BasicBlock Tail) {
		CheckBB->setName(OriginalBBName);
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Typo? delete -> deleted. SjoerdMeijer: Typo? delete -> deleted.
		CheckBB->getTerminator()->eraseFromParent();
		;
		{
		IRBuilder<> Builder(CheckBB);
		Builder.CreateBr(ScalarBB);
		}
		DTU.applyUpdates({{DominatorTree::Delete, CheckBB, VectorBB}});
		LI->removeBlock(VectorBB);
		VectorBB->getTerminator()->eraseFromParent();
		;
		{
		IRBuilder<> Builder(VectorBB);
		Builder.CreateUnreachable();
		}
		DTU.applyUpdates({{DominatorTree::Delete, VectorBB, MergeBB}});
		DTU.deleteBB(VectorBB);
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Style: this function is nearly 300 lines, which I think is big for a function, and it is already not the easiest function to read. I would prefer to see this split up and using helper functions. SjoerdMeijer: Style: this function is nearly 300 lines, which I think is big for a function, and it is…
		MergeBlockIntoPredecessor(MergeBB, &DTU, LI);
		if (Tail)
		MergeBlockIntoPredecessor(Tail, &DTU, LI);
		MergeBlockIntoPredecessor(ScalarBB, &DTU, LI);
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Unused? SjoerdMeijer: Unused?
		fhahnAuthorUnsubmitted Done Reply Inline Actions I think this is out of date, AS and WrittenObjs should be used. fhahn: I think this is out of date, AS and WrittenObjs should be used.
		NumVersioningFailed++;
		}
		SjoerdMeijerUnsubmitted Done Reply Inline Actions That wasn't obvious to me, why is Start dereferenced? SjoerdMeijer: That wasn't obvious to me, why is Start dereferenced?
		fhahnAuthorUnsubmitted Done Reply Inline Actions We get the `Start` expression from the pointer operand of either a store or load, so it should always be dereference by those instructions. fhahn: We get the `Start` expression from the pointer operand of either a store or load, so it should…

		SjoerdMeijerUnsubmitted Done Reply Inline Actions Do we want to link to alive (not sure how persistent the data is and if it's common), or just expand the example here? SjoerdMeijer: Do we want to link to alive (not sure how persistent the data is and if it's common), or just…
		fhahnAuthorUnsubmitted Done Reply Inline Actions I'll remove it. fhahn: I'll remove it.
		SLPVectorizerResult SLPVectorizerPass::vectorizeBlockWithVersioning(
		BasicBlock BB, const SmallPtrSetImpl<Value > &TrackedObjects,
		slpvectorizer::BoUpSLP &R) {
		// Try to vectorize BB with versioning.
		//
		// First, collect all memory bounds for accesses in the block.
		//
		// Next, split off the region between the first and last tracked memory
		// access.
		//
		// Then, duplicate the split off region, one will remain scalar and one will
		// be annotated with noalias metadata.
		//
		// Then introduce placeholder blocks for the memory runtime checks (branch to
		// either scalar or versioned blocks) and a merge block joining the control
		// flow from scalar and versioned blocks.
		//
		// Then, add noalias metadata for memory accessed in the versioned block and
		// run SLP vectorization on the versioned block.
		//
		// Now compare the cost of the scalar block against the cost of the vector
		// block + the cost of the runtime checks. If the vector cost is less than the
		// scalar cost, generate runtime checks in the check block. Otherwise remove
		// all temporary blocks and restore the original IR.

		bool Changed = false;
		bool CFGChanged = false;
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Could `std::next` return null (or end)? SjoerdMeijer: Could `std::next` return null (or end)?
		fhahnAuthorUnsubmitted Done Reply Inline Actions I don't think so; both `LastTracked` and `FirstTracked` should point to loads or stores, so there should always be a next instruction (the terminator). If they would be 0, there shouldn't be any bounds (MemBounds should be empty). fhahn: I don't think so; both `LastTracked` and `FirstTracked` should point to loads or stores, so…
		R.AliasCache.clear();

		// First, clean up deleted instructions, so they are not re-used during SCEV
		// expansion.
		R.optimizeGatherSequence();
		R.removeDeletedInstructions();

		auto &DL = BB->getModule()->getDataLayout();
		// Collect up-to-date memory bounds for tracked objects. Also collect the
		// first and last memory instruction using a tracked object.
		MapVector<Value *, RuntimeCheckingPtrGroup> MemBounds;
		SmallPtrSet<Value *, 4> WrittenObjs;
		// First instruction that accesses an object we collect bounds for.
		Instruction *FirstTrackedInst = nullptr;
		// Last instruction that accesses an object we collect bounds for.
		Instruction *LastTrackedInst = nullptr;

		DenseMap<Value *, unsigned> ObjOrder;
		unsigned Order = 0;
		for (Instruction &I : *BB) {
		auto ObjAndStart = AccessInfo::get(I, SE, DT);
		if (!ObjAndStart.UnderlyingObj)
		continue;
		auto *Obj = ObjAndStart.UnderlyingObj;
		const auto *Start = ObjAndStart.PtrSCEV;

		if (I.mayWriteToMemory())
		WrittenObjs.insert(Obj);

		unsigned AS = Obj->getType()->getPointerAddressSpace();

		// We know that the Start is dereferenced, hence adding one should not
		// overflow:
		Type *IdxTy = DL.getIndexType(Obj->getType());
		const SCEV *EltSizeSCEV =
		SE->getStoreSizeOfExpr(IdxTy, ObjAndStart.AccessTy);
		auto *End = SE->getAddExpr(Start, EltSizeSCEV);

		SjoerdMeijerUnsubmitted Done Reply Inline Actions Don't think I have seen a block name with the .tail suffix in the tests. SjoerdMeijer: Don't think I have seen a block name with the .tail suffix in the tests.
		fhahnAuthorUnsubmitted Done Reply Inline Actions This split is mainly there to make it easier to split off only the region containing from first tracked to last tracked. It is later folded back into the merge block. fhahn: This split is mainly there to make it easier to split off only the region containing from first…
		if (TrackedObjects.find(Obj) != TrackedObjects.end())
		MemBounds.insert({Obj, {0, Start, End, AS, false}});
		auto BoundsIter = MemBounds.find(Obj);
		if (BoundsIter == MemBounds.end())
		continue;
		BoundsIter->second.addPointer(0, Start, End, AS, false, *SE);

		if (ObjOrder.find(Obj) == ObjOrder.end()) {
		ObjOrder[Obj] = Order++;
		}
		if (!FirstTrackedInst)
		FirstTrackedInst = &I;
		LastTrackedInst = &I;
		}

		// Not enough memory access bounds for runtime checks.
		if (MemBounds.size() < 2 \|\| WrittenObjs.empty())
		return {Changed, CFGChanged};

		// Check if all uses between the first and last tracked instruction are inside
		// the region. If that is not the case, PHIs would need to be added when
		// duplicating the block.
		auto AllUsesInside = [FirstTrackedInst, LastTrackedInst](BasicBlock *BB) {
		return all_of(make_range(FirstTrackedInst->getIterator(),
		std::next(LastTrackedInst->getIterator())),
		[LastTrackedInst, BB](Instruction &I) {
		return all_of(I.users(), [LastTrackedInst, BB](User *U) {
		if (auto *UserI = dyn_cast<Instruction>(U))
		return UserI->getParent() == BB &&
		!isa<PHINode>(UserI) &&
		(UserI->comesBefore(LastTrackedInst) \|\|
		UserI == LastTrackedInst);
		return true;
		});
		});
		};
		SjoerdMeijerUnsubmitted Done Reply Inline Actions typo: versiond -> versioned. SjoerdMeijer: typo: versiond -> versioned.
		if (!AllUsesInside(BB))
		return {Changed, CFGChanged};

		SjoerdMeijerUnsubmitted Done Reply Inline Actions I am not that familiar with this, but do we expect this in one of the tests? SjoerdMeijer: I am not that familiar with this, but do we expect this in one of the tests?
		fhahnAuthorUnsubmitted Done Reply Inline Actions The domain? It is in the metadata nodes created, but the tests only check the `!noalias` attachments, not the definitions of the metadata at the moment. fhahn: The domain? It is in the metadata nodes created, but the tests only check the `!noalias`…
		SmallVector<std::pair<Value , RuntimeCheckingPtrGroup >> BoundGroups;
		for (auto &B : MemBounds)
		BoundGroups.emplace_back(B.first, &B.second);

		// Create a RuntimePointerCheck for all groups in BoundGroups.
		SmallVector<PointerDiffInfo> PointerChecks;
		uint64_t MaxDist = 0;

		for (auto &P : R.DepObjs) {
		Value *SrcObj = P.first;
		Value *SinkObj = P.second;
		if (ObjOrder[SrcObj] > ObjOrder[SinkObj])
		std::swap(SrcObj, SinkObj);
		ABataevUnsubmitted Done Reply Inline Actions Move out of the loop. Also, similar lambda is used in another place, worth creating a static function instead of lambdas. ABataev: Move out of the loop. Also, similar lambda is used in another place, worth creating a static…
		fhahnAuthorUnsubmitted Done Reply Inline Actions updated to use `getLoadStorePointerOperand` from `Instructions.h`. fhahn: updated to use `getLoadStorePointerOperand` from `Instructions.h`.

		auto &SrcGroup = MemBounds.find(SrcObj)->second;
		auto &SinkGroup = MemBounds.find(SinkObj)->second;
		bool SrcWrites = WrittenObjs.contains(SrcObj);
		bool SinkWrites = WrittenObjs.contains(SinkObj);
		if (!SrcWrites && !SinkWrites)
		continue;
		const SCEV *CurDist =
		SE->getUMaxExpr(SE->getMinusSCEV(SrcGroup.High, SrcGroup.Low),
		SE->getMinusSCEV(SinkGroup.High, SinkGroup.Low));
		if (auto *C = dyn_cast<SCEVConstant>(CurDist)) {
		MaxDist = std::max(MaxDist, C->getValue()->getZExtValue());
		IntegerType *IntTy = IntegerType::get(
		BB->getContext(), DL.getPointerSizeInBits(SinkGroup.AddressSpace));
		const SCEV *SinkStartInt = SE->getPtrToIntExpr(SinkGroup.Low, IntTy);
		const SCEV *SrcStartInt = SE->getPtrToIntExpr(SrcGroup.Low, IntTy);
		if (isa<SCEVCouldNotCompute>(SinkStartInt) \|\|
		isa<SCEVCouldNotCompute>(SrcStartInt)) {
		return {Changed, CFGChanged};
		}

		PointerChecks.emplace_back(SinkStartInt, SrcStartInt, 1, false);
		} else
		return {Changed, CFGChanged};
		}

		// Duplicate BB now and set up block and branches for memory checks.
		std::string OriginalBBName = BB->getName().str();
		IRBuilder<> ChkBuilder(BB->getFirstNonPHI());
		DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);

		BasicBlock *Tail = nullptr;
		if (LastTrackedInst->getNextNode() != BB->getTerminator())
		Tail = SplitBlock(BB, LastTrackedInst->getNextNode(), &DTU, LI, nullptr,
		OriginalBBName + ".tail");
		auto *CheckBB = BB;
		BB = SplitBlock(BB, FirstTrackedInst, &DTU, LI, nullptr,
		OriginalBBName + ".slpmemcheck");
		for (Use &U : make_early_inc_range(BB->uses())) {
		BasicBlock *UserBB = cast<Instruction>(U.getUser())->getParent();
		if (UserBB == CheckBB)
		continue;

		U.set(CheckBB);
		DTU.applyUpdates({{DT->Delete, UserBB, BB}});
		DTU.applyUpdates({{DT->Insert, UserBB, CheckBB}});
		}
		CFGChanged = true;

		auto *MergeBB = BB;
		BasicBlock *ScalarBB =
		splitBlockBefore(BB, BB->getTerminator(), &DTU, LI, nullptr,
		OriginalBBName + ".slpversioned");

		ValueToValueMapTy VMap;
		BasicBlock *VectorBB = CloneBasicBlock(ScalarBB, VMap, "", BB->getParent());
		ScalarBB->setName(OriginalBBName + ".scalar");
		MergeBB->setName(OriginalBBName + ".merge");
		SmallVector<BasicBlock *> Tmp;
		Tmp.push_back(VectorBB);
		remapInstructionsInBlocks(Tmp, VMap);
		auto *Term = CheckBB->getTerminator();
		ChkBuilder.SetInsertPoint(CheckBB->getTerminator());
		ChkBuilder.CreateCondBr(ChkBuilder.getTrue(), ScalarBB, VectorBB);
		Term->eraseFromParent();
		DTU.applyUpdates({{DT->Insert, CheckBB, VectorBB}});
		if (auto *L = LI->getLoopFor(CheckBB))
		L->addBasicBlockToLoop(VectorBB, *LI);
		Changed = true;

		// Add !noalias metadata to memory accesses in the versioned block.
		ABataevUnsubmitted Done Reply Inline Actions Do you consider the cost of all new check instructions as `1` here? ABataev: Do you consider the cost of all new check instructions as `1` here?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Yes, it assumes each instruction costs `1`. fhahn: Yes, it assumes each instruction costs `1`.
		LLVMContext &Ctx = BB->getContext();
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Perhaps create a local helper function to handle this case. SjoerdMeijer: Perhaps create a local helper function to handle this case.
		fhahnAuthorUnsubmitted Done Reply Inline Actions I moved it to a function. fhahn: I moved it to a function.
		MDBuilder MDB(Ctx);
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Re: "or possible", I would guess it's possible but not profitable? SjoerdMeijer: Re: "or possible", I would guess it's possible but not profitable?
		fhahnAuthorUnsubmitted Done Reply Inline Actions I guess saying not beneficial is enough. When we reach here, it is always possible. Originally the `possible` bit was intended to refer to the case where we version but nothing gets vectorized because SLP fails due to other reasons. That's not really clear though, I'll remove it. fhahn: I guess saying not beneficial is enough. When we reach here, it is always possible. Originally…
		MDNode *Domain = MDB.createAnonymousAliasScopeDomain("SLPVerDomain");

		DenseMap<const RuntimeCheckingPtrGroup , MDNode > GroupToScope;
		for (const auto &Group : MemBounds)
		GroupToScope[&Group.second] = MDB.createAnonymousAliasScope(Domain);

		for (Instruction &I : *VectorBB) {
		auto *Ptr = getLoadStorePointerOperand(&I);
		if (!Ptr)
		continue;

		auto *PtrSCEV = SE->getSCEV(Ptr);
		Value *Obj = getUnderlyingObject(Ptr);
		if (!Obj) {
		if (auto *GEP = dyn_cast<GetElementPtrInst>(Ptr))
		Obj = GEP->getOperand(0);
		else
		continue;
		}

		auto BoundsIter = MemBounds.find(Obj);
		if (BoundsIter == MemBounds.end())
		continue;
		auto *LowerBound = BoundsIter->second.Low;
		auto *UpperBound = BoundsIter->second.High;
		auto *Scope = GroupToScope.find(&BoundsIter->second)->second;

		SjoerdMeijerUnsubmitted Done Reply Inline Actions And another helper for this case too. SjoerdMeijer: And another helper for this case too.
		fhahnAuthorUnsubmitted Done Reply Inline Actions I think moving this to a separate function would require a lot of arguments and the code here is quite compact. fhahn: I think moving this to a separate function would require a lot of arguments and the code here…
		auto *LowerSub = SE->getMinusSCEV(PtrSCEV, LowerBound);
		auto *UpperSub = SE->getMinusSCEV(UpperBound, PtrSCEV);
		if (!isa<SCEVCouldNotCompute>(LowerSub) &&
		!isa<SCEVCouldNotCompute>(UpperSub) &&
		SE->isKnownNonNegative(LowerSub) && SE->isKnownNonNegative(UpperSub)) {
		I.setMetadata(
		LLVMContext::MD_alias_scope,
		MDNode::concatenate(I.getMetadata(LLVMContext::MD_alias_scope),
		MDNode::get(Ctx, Scope)));

		SmallVector<Metadata *, 4> NonAliasing;
		for (auto &KV : GroupToScope) {
		if (KV.first == &BoundsIter->second)
		continue;
		NonAliasing.push_back(KV.second);
		}
		I.setMetadata(LLVMContext::MD_noalias,
		MDNode::concatenate(I.getMetadata(LLVMContext::MD_noalias),
		MDNode::get(Ctx, NonAliasing)));
		}
		}

		DTU.flush();
		DT->updateDFSNumbers();
		collectSeedInstructions(VectorBB);

		// Vectorize trees that end at stores.
		assert(!Stores.empty() && "should have stores when versioning");
		LLVM_DEBUG(dbgs() << "SLP: Found stores for " << Stores.size()
		<< " underlying objects.\n");
		bool AnyVectorized = vectorizeStoreChains(R);
		Changed \|= AnyVectorized;

		InstructionCost SLPCost = 0;
		InstructionCost ScalarCost = 0;
		if (AnyVectorized) {
		R.optimizeGatherSequence();
		R.removeDeletedInstructions();
		for (Instruction &I : *ScalarBB)
		ScalarCost += TTI->getInstructionCost(&I, TTI::TCK_RecipThroughput);
		for (Instruction &I : make_early_inc_range(reverse(*VectorBB))) {
		if (isInstructionTriviallyDead(&I, TLI)) {
		I.eraseFromParent();
		continue;
		}
		SLPCost += TTI->getInstructionCost(&I, TTI::TCK_RecipThroughput);
		}

		// Estimate the size of the runtime checks, consisting of computing lower &
		// upper bounds (2), the overlap checks (2) and the AND/OR to combine the
		// checks.
		SLPCost += 5 * PointerChecks.size() + MemBounds.size();
		}

		if (!AnyVectorized \|\| SLPCost >= ScalarCost) {
		// Vectorization not beneficial or possible. Restore original state by
		// removing the introduced blocks.
		R.getORE()->emit([&]() {
		OptimizationRemarkMissed Rem(SV_NAME, "VersioningNotBeneficial",
		&*ScalarBB->begin());
		Rem << "Tried to version block but was not beneficial";
		if (AnyVectorized) {
		Rem << ore::NV("VectorCost", SLPCost)
		<< " >= " << ore::NV("ScalarCost", ScalarCost);
		} else
		Rem << "(nothing vectorized)";
		return Rem;
		});
		Changed = false;
		CFGChanged = false;
		undoVersionedBlocks(CheckBB, ScalarBB, DTU, LI, VectorBB, OriginalBBName,
		MergeBB, Tail);
		} else {
		R.getORE()->emit(
		OptimizationRemark(SV_NAME, "VersioningSuccessful", &*ScalarBB->begin())
		<< "SLP vectorization with versioning is beneficial "
		<< ore::NV("VectorCost", SLPCost) << " < "
		<< ore::NV("ScalarCost", ScalarCost)
		<< ore::NV("AnyVectorized", AnyVectorized));

		ChkBuilder.SetInsertPoint(CheckBB->getTerminator());
		SCEVExpander Exp(*SE, BB->getParent()->getParent()->getDataLayout(),
		"memcheck");
		Value *MemoryOverlap = addDiffRuntimeChecks(
		CheckBB->getTerminator(), PointerChecks, Exp,
		[MaxDist](IRBuilderBase &B, unsigned Bits) {
		return B.getIntN(Bits, MaxDist);
		},
		1);
		/* Value MemoryOverlap =/
		/addRuntimeChecks(CheckBB->getTerminator(), nullptr, PointerChecks, Exp);/
		assert(MemoryOverlap &&
		"runtime checks required, but no checks generated in IR?");
		cast<BranchInst>(CheckBB->getTerminator())->setCondition(MemoryOverlap);
		NumVersioningSuccessful++;
		}
		DTU.flush();
		DT->updateDFSNumbers();

		return {Changed, CFGChanged};
		}

		SLPVectorizerResult SLPVectorizerPass::runImpl(
		Function &F, ScalarEvolution SE_, TargetTransformInfo TTI_,
		TargetLibraryInfo TLI_, AAResults AA_, LoopInfo LI_, DominatorTree DT_,
		AssumptionCache AC_, DemandedBits DB_, OptimizationRemarkEmitter *ORE_) {
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Somewhere a high-level idea and description of the algorithm would be nice, I guess that would be here. SjoerdMeijer: Somewhere a high-level idea and description of the algorithm would be nice, I guess that would…
		fhahnAuthorUnsubmitted Done Reply Inline Actions I added an outline at the top of the function. fhahn: I added an outline at the top of the function.
if (!RunSLPVectorization)		if (!RunSLPVectorization)
return false;		return {false, false};
SE = SE_;		SE = SE_;
TTI = TTI_;		TTI = TTI_;
TLI = TLI_;		TLI = TLI_;
AA = AA_;		AA = AA_;
LI = LI_;		LI = LI_;
DT = DT_;		DT = DT_;
AC = AC_;		AC = AC_;
DB = DB_;		DB = DB_;
DL = &F.getParent()->getDataLayout();		DL = &F.getParent()->getDataLayout();

Stores.clear();		Stores.clear();
GEPs.clear();		GEPs.clear();
bool Changed = false;		bool Changed = false;
		bool CFGChanged = false;

// If the target claims to have no vector registers don't attempt		// If the target claims to have no vector registers don't attempt
// vectorization.		// vectorization.
if (!TTI->getNumberOfRegisters(TTI->getRegisterClassForType(true))) {		if (!TTI->getNumberOfRegisters(TTI->getRegisterClassForType(true))) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "SLP: Didn't find any vector registers for target, abort.\n");		dbgs() << "SLP: Didn't find any vector registers for target, abort.\n");
return false;		return {false, false};
}		}

// Don't vectorize when the attribute NoImplicitFloat is used.		// Don't vectorize when the attribute NoImplicitFloat is used.
if (F.hasFnAttribute(Attribute::NoImplicitFloat))		if (F.hasFnAttribute(Attribute::NoImplicitFloat))
return false;		return {false, false};

LLVM_DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Analyzing blocks in " << F.getName() << ".\n");

// Use the bottom up slp vectorizer to construct chains that start with		// Use the bottom up slp vectorizer to construct chains that start with
// store instructions.		// store instructions.
BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC, DB, DL, ORE_);		BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC, DB, DL, ORE_);

// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to		// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to
// delete instructions.		// delete instructions.

// Update DFS numbers now so that we can use them for ordering.		// Update DFS numbers now so that we can use them for ordering.
DT->updateDFSNumbers();		DT->updateDFSNumbers();

		SmallVector<BasicBlock *, 4> BlocksToRetry;
		SmallVector<SmallPtrSet<Value *, 8>, 4> BoundsToUse;
// Scan the blocks in the function in post order.		// Scan the blocks in the function in post order.
for (auto BB : post_order(&F.getEntryBlock())) {		for (auto BB : post_order(&F.getEntryBlock())) {
// Start new block - clear the list of reduction roots.		// Start new block - clear the list of reduction roots.
R.clearReductionData();		R.clearReductionData();
collectSeedInstructions(BB);		collectSeedInstructions(BB);

		bool VectorizedBlock = false;
// Vectorize trees that end at stores.		// Vectorize trees that end at stores.
if (!Stores.empty()) {		if (!Stores.empty()) {
LLVM_DEBUG(dbgs() << "SLP: Found stores for " << Stores.size()		LLVM_DEBUG(dbgs() << "SLP: Found stores for " << Stores.size()
<< " underlying objects.\n");		<< " underlying objects.\n");
Changed \|= vectorizeStoreChains(R);		R.TrackedObjects.clear();

		if (EnableMemoryVersioning)
		R.CollectMemAccess = BB->size() <= 300;
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Make this an internal option, or a `static constexpr`. SjoerdMeijer: Make this an internal option, or a `static constexpr`.

		VectorizedBlock = vectorizeStoreChains(R);

		R.CollectMemAccess = false;
}		}

// Vectorize trees that end at reductions.		// Vectorize trees that end at reductions.
Changed \|= vectorizeChainsInBlock(BB, R);		VectorizedBlock \|= vectorizeChainsInBlock(BB, R);

// Vectorize the index computations of getelementptr instructions. This		// Vectorize the index computations of getelementptr instructions. This
// is primarily intended to catch gather-like idioms ending at		// is primarily intended to catch gather-like idioms ending at
// non-consecutive loads.		// non-consecutive loads.
if (!GEPs.empty()) {		if (!GEPs.empty()) {
LLVM_DEBUG(dbgs() << "SLP: Found GEPs for " << GEPs.size()		LLVM_DEBUG(dbgs() << "SLP: Found GEPs for " << GEPs.size()
<< " underlying objects.\n");		<< " underlying objects.\n");
Changed \|= vectorizeGEPIndices(BB, R);		VectorizedBlock \|= vectorizeGEPIndices(BB, R);
}		}

		if (!VectorizedBlock && !R.TrackedObjects.empty()) {
		BlocksToRetry.push_back(BB);
		BoundsToUse.push_back(R.TrackedObjects);
		}
		ABataevUnsubmitted Not Done Reply Inline Actions `300` better to turn to option value. ABataev: `300` better to turn to option value.
		R.TrackedObjects.clear();
		Changed \|= VectorizedBlock;
		}

		ABataevUnsubmitted Not Done Reply Inline Actions I think this can be simplified. This may consume a lot of time. Better to implement some kind of a simple traversal here and check if there are memaccesses in the operands, check if they may alias, and gather these aliases. ABataev: I think this can be simplified. This may consume a lot of time. Better to implement some kind…
		for (unsigned I = 0; I != BlocksToRetry.size(); I++) {
		auto Status =
		vectorizeBlockWithVersioning(BlocksToRetry[I], BoundsToUse[I], R);
		Changed \|= Status.MadeAnyChange;
		CFGChanged \|= Status.MadeCFGChange;
}		}

if (Changed) {		if (Changed) {
R.optimizeGatherSequence();		R.optimizeGatherSequence();
LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");		LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");
}		}
return Changed;
		return {Changed, CFGChanged};
}		}

bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned Idx, unsigned MinVF) {		unsigned Idx, unsigned MinVF) {
LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()
<< "\n");		<< "\n");
const unsigned Sz = R.getVectorElementSize(Chain[0]);		const unsigned Sz = R.getVectorElementSize(Chain[0]);
unsigned VF = Chain.size();		unsigned VF = Chain.size();
▲ Show 20 Lines • Show All 2,426 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic < %s \| FileCheck %s		; RUN: opt -slp-memory-versioning -scoped-noalias-aa -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic -enable-new-pm=false < %s \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64"		target triple = "aarch64"

%struct.weight_t = type { i32, i32 }		%struct.weight_t = type { i32, i32 }

define void @f_noalias(i8* noalias nocapture %dst, i8* noalias nocapture readonly %src, %struct.weight_t* noalias nocapture readonly %w) {		define void @f_noalias(i8* noalias nocapture %dst, i8* noalias nocapture readonly %src, %struct.weight_t* noalias nocapture readonly %w) {
; CHECK-LABEL: @f_noalias(		; CHECK-LABEL: @f_noalias(
▲ Show 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	entry:
%conv.i.3 = trunc i32 %cond.i.3 to i8		%conv.i.3 = trunc i32 %cond.i.3 to i8
%arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3		%arrayidx2.3 = getelementptr inbounds i8, i8* %dst, i64 3
store i8 %conv.i.3, i8* %arrayidx2.3, align 1		store i8 %conv.i.3, i8* %arrayidx2.3, align 1
ret void		ret void
}		}

; This is the same test as above, expect that the pointers don't have 'noalias'.		; This is the same test as above, expect that the pointers don't have 'noalias'.
; This currently prevents SLP vectorization, but the SLP vectorizer should		; This currently prevents SLP vectorization, but the SLP vectorizer should
; be taught to emit runtime checks enabling vectorization.		; be taught to emit runtime checks enabling vectorization.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions This comment is out of date now. SjoerdMeijer: This comment is out of date now.
;		;
define void @f_alias(i8* nocapture %dst, i8* nocapture readonly %src, %struct.weight_t* nocapture readonly %w) {		define void @f_alias(i8* nocapture %dst, i8* nocapture readonly %src, %struct.weight_t* nocapture readonly %w) {
; CHECK-LABEL: @f_alias(		; CHECK-LABEL: @f_alias(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[DST38:%.]] = ptrtoint i8 [[DST:%.*]] to i64
		; CHECK-NEXT: [[SRC37:%.]] = ptrtoint i8 [[SRC:%.*]] to i64
; CHECK-NEXT: [[SCALE:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T:%.]], %struct.weight_t* [[W:%.*]], i64 0, i32 0		; CHECK-NEXT: [[SCALE:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T:%.]], %struct.weight_t* [[W:%.*]], i64 0, i32 0
; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[SCALE]], align 16		; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[SCALE]], align 16
; CHECK-NEXT: [[OFFSET:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T]], %struct.weight_t [[W]], i64 0, i32 1		; CHECK-NEXT: [[OFFSET:%.]] = getelementptr inbounds [[STRUCT_WEIGHT_T]], %struct.weight_t [[W]], i64 0, i32 1
; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[OFFSET]], align 4		; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[OFFSET]], align 4
; CHECK-NEXT: [[TMP2:%.]] = load i8, i8 [[SRC:%.*]], align 1		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 [[SRC37]], [[DST38]]
; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP2]] to i32		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP2]], 4
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[ENTRY_SCALAR:%.]], label [[ENTRY_SLPVERSIONED1:%.]]
		; CHECK: entry.scalar:
		; CHECK-NEXT: [[TMP3:%.]] = load i8, i8 [[SRC]], align 1
		; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP3]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP0]], [[CONV]]		; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP0]], [[CONV]]
; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[MUL]], [[TMP1]]		; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[MUL]], [[TMP1]]
; CHECK-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp ult i32 [[ADD]], 256		; CHECK-NEXT: [[TOBOOL_NOT_I:%.*]] = icmp ult i32 [[ADD]], 256
; CHECK-NEXT: [[TMP3:%.*]] = icmp sgt i32 [[ADD]], 0		; CHECK-NEXT: [[TMP4:%.*]] = icmp sgt i32 [[ADD]], 0
; CHECK-NEXT: [[SHR_I:%.*]] = sext i1 [[TMP3]] to i32		; CHECK-NEXT: [[SHR_I:%.*]] = sext i1 [[TMP4]] to i32
; CHECK-NEXT: [[COND_I:%.*]] = select i1 [[TOBOOL_NOT_I]], i32 [[ADD]], i32 [[SHR_I]]		; CHECK-NEXT: [[COND_I:%.*]] = select i1 [[TOBOOL_NOT_I]], i32 [[ADD]], i32 [[SHR_I]]
; CHECK-NEXT: [[CONV_I:%.*]] = trunc i32 [[COND_I]] to i8		; CHECK-NEXT: [[CONV_I:%.*]] = trunc i32 [[COND_I]] to i8
; CHECK-NEXT: store i8 [[CONV_I]], i8* [[DST:%.*]], align 1		; CHECK-NEXT: store i8 [[CONV_I]], i8* [[DST]], align 1
; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 1		; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 1
; CHECK-NEXT: [[TMP4:%.]] = load i8, i8 [[ARRAYIDX_1]], align 1		; CHECK-NEXT: [[TMP5:%.]] = load i8, i8 [[ARRAYIDX_1]], align 1
; CHECK-NEXT: [[CONV_1:%.*]] = zext i8 [[TMP4]] to i32		; CHECK-NEXT: [[CONV_1:%.*]] = zext i8 [[TMP5]] to i32
; CHECK-NEXT: [[MUL_1:%.*]] = mul nsw i32 [[TMP0]], [[CONV_1]]		; CHECK-NEXT: [[MUL_1:%.*]] = mul nsw i32 [[TMP0]], [[CONV_1]]
; CHECK-NEXT: [[ADD_1:%.*]] = add nsw i32 [[MUL_1]], [[TMP1]]		; CHECK-NEXT: [[ADD_1:%.*]] = add nsw i32 [[MUL_1]], [[TMP1]]
; CHECK-NEXT: [[TOBOOL_NOT_I_1:%.*]] = icmp ult i32 [[ADD_1]], 256		; CHECK-NEXT: [[TOBOOL_NOT_I_1:%.*]] = icmp ult i32 [[ADD_1]], 256
; CHECK-NEXT: [[TMP5:%.*]] = icmp sgt i32 [[ADD_1]], 0		; CHECK-NEXT: [[TMP6:%.*]] = icmp sgt i32 [[ADD_1]], 0
; CHECK-NEXT: [[SHR_I_1:%.*]] = sext i1 [[TMP5]] to i32		; CHECK-NEXT: [[SHR_I_1:%.*]] = sext i1 [[TMP6]] to i32
; CHECK-NEXT: [[COND_I_1:%.*]] = select i1 [[TOBOOL_NOT_I_1]], i32 [[ADD_1]], i32 [[SHR_I_1]]		; CHECK-NEXT: [[COND_I_1:%.*]] = select i1 [[TOBOOL_NOT_I_1]], i32 [[ADD_1]], i32 [[SHR_I_1]]
; CHECK-NEXT: [[CONV_I_1:%.*]] = trunc i32 [[COND_I_1]] to i8		; CHECK-NEXT: [[CONV_I_1:%.*]] = trunc i32 [[COND_I_1]] to i8
; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 1		; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 1
; CHECK-NEXT: store i8 [[CONV_I_1]], i8* [[ARRAYIDX2_1]], align 1		; CHECK-NEXT: store i8 [[CONV_I_1]], i8* [[ARRAYIDX2_1]], align 1
; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 2		; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 2
; CHECK-NEXT: [[TMP6:%.]] = load i8, i8 [[ARRAYIDX_2]], align 1		; CHECK-NEXT: [[TMP7:%.]] = load i8, i8 [[ARRAYIDX_2]], align 1
; CHECK-NEXT: [[CONV_2:%.*]] = zext i8 [[TMP6]] to i32		; CHECK-NEXT: [[CONV_2:%.*]] = zext i8 [[TMP7]] to i32
; CHECK-NEXT: [[MUL_2:%.*]] = mul nsw i32 [[TMP0]], [[CONV_2]]		; CHECK-NEXT: [[MUL_2:%.*]] = mul nsw i32 [[TMP0]], [[CONV_2]]
; CHECK-NEXT: [[ADD_2:%.*]] = add nsw i32 [[MUL_2]], [[TMP1]]		; CHECK-NEXT: [[ADD_2:%.*]] = add nsw i32 [[MUL_2]], [[TMP1]]
; CHECK-NEXT: [[TOBOOL_NOT_I_2:%.*]] = icmp ult i32 [[ADD_2]], 256		; CHECK-NEXT: [[TOBOOL_NOT_I_2:%.*]] = icmp ult i32 [[ADD_2]], 256
; CHECK-NEXT: [[TMP7:%.*]] = icmp sgt i32 [[ADD_2]], 0		; CHECK-NEXT: [[TMP8:%.*]] = icmp sgt i32 [[ADD_2]], 0
; CHECK-NEXT: [[SHR_I_2:%.*]] = sext i1 [[TMP7]] to i32		; CHECK-NEXT: [[SHR_I_2:%.*]] = sext i1 [[TMP8]] to i32
; CHECK-NEXT: [[COND_I_2:%.*]] = select i1 [[TOBOOL_NOT_I_2]], i32 [[ADD_2]], i32 [[SHR_I_2]]		; CHECK-NEXT: [[COND_I_2:%.*]] = select i1 [[TOBOOL_NOT_I_2]], i32 [[ADD_2]], i32 [[SHR_I_2]]
; CHECK-NEXT: [[CONV_I_2:%.*]] = trunc i32 [[COND_I_2]] to i8		; CHECK-NEXT: [[CONV_I_2:%.*]] = trunc i32 [[COND_I_2]] to i8
; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 2		; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 2
; CHECK-NEXT: store i8 [[CONV_I_2]], i8* [[ARRAYIDX2_2]], align 1		; CHECK-NEXT: store i8 [[CONV_I_2]], i8* [[ARRAYIDX2_2]], align 1
; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 3		; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i8, i8 [[SRC]], i64 3
; CHECK-NEXT: [[TMP8:%.]] = load i8, i8 [[ARRAYIDX_3]], align 1		; CHECK-NEXT: [[TMP9:%.]] = load i8, i8 [[ARRAYIDX_3]], align 1
; CHECK-NEXT: [[CONV_3:%.*]] = zext i8 [[TMP8]] to i32		; CHECK-NEXT: [[CONV_3:%.*]] = zext i8 [[TMP9]] to i32
; CHECK-NEXT: [[MUL_3:%.*]] = mul nsw i32 [[TMP0]], [[CONV_3]]		; CHECK-NEXT: [[MUL_3:%.*]] = mul nsw i32 [[TMP0]], [[CONV_3]]
; CHECK-NEXT: [[ADD_3:%.*]] = add nsw i32 [[MUL_3]], [[TMP1]]		; CHECK-NEXT: [[ADD_3:%.*]] = add nsw i32 [[MUL_3]], [[TMP1]]
; CHECK-NEXT: [[TOBOOL_NOT_I_3:%.*]] = icmp ult i32 [[ADD_3]], 256		; CHECK-NEXT: [[TOBOOL_NOT_I_3:%.*]] = icmp ult i32 [[ADD_3]], 256
; CHECK-NEXT: [[TMP9:%.*]] = icmp sgt i32 [[ADD_3]], 0		; CHECK-NEXT: [[TMP10:%.*]] = icmp sgt i32 [[ADD_3]], 0
; CHECK-NEXT: [[SHR_I_3:%.*]] = sext i1 [[TMP9]] to i32		; CHECK-NEXT: [[SHR_I_3:%.*]] = sext i1 [[TMP10]] to i32
; CHECK-NEXT: [[COND_I_3:%.*]] = select i1 [[TOBOOL_NOT_I_3]], i32 [[ADD_3]], i32 [[SHR_I_3]]		; CHECK-NEXT: [[COND_I_3:%.*]] = select i1 [[TOBOOL_NOT_I_3]], i32 [[ADD_3]], i32 [[SHR_I_3]]
; CHECK-NEXT: [[CONV_I_3:%.*]] = trunc i32 [[COND_I_3]] to i8		; CHECK-NEXT: [[CONV_I_3:%.*]] = trunc i32 [[COND_I_3]] to i8
; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 3		; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i8, i8 [[DST]], i64 3
; CHECK-NEXT: store i8 [[CONV_I_3]], i8* [[ARRAYIDX2_3]], align 1		; CHECK-NEXT: store i8 [[CONV_I_3]], i8* [[ARRAYIDX2_3]], align 1
		; CHECK-NEXT: br label [[ENTRY_MERGE:%.*]]
		; CHECK: entry.merge:
		ABataevUnsubmitted Not Done Reply Inline Actions What's changed here? Just the code growths? ABataev: What's changed here? Just the code growths?
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Was wondering the same. To me it looks like we are emitting the runtime checks, but still not vectorising, which we probably want to avoid. SjoerdMeijer: Was wondering the same. To me it looks like we are emitting the runtime checks, but still not…
		fhahnAuthorUnsubmitted Done Reply Inline Actions This is what happens if versioning was attempted, but did not enable SLP vectorisation for the block. Probably a case where some memory operations where excluded for some reason. I'll investigate. The versioned block has been deleted, but the dead runtime checks still remain, but those should be cleaned up by some later pass. fhahn: This is what happens if versioning was attempted, but did not enable SLP vectorisation for the…
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: entry.slpversioned1:
		; CHECK-NEXT: [[TMP11:%.]] = bitcast i8 [[SRC]] to <4 x i8>*
		; CHECK-NEXT: [[TMP12:%.]] = load <4 x i8>, <4 x i8> [[TMP11]], align 1, !alias.scope !0, !noalias !3
		; CHECK-NEXT: [[TMP13:%.*]] = zext <4 x i8> [[TMP12]] to <4 x i32>
		; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x i32> poison, i32 [[TMP0]], i32 0
		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP14]], <4 x i32> poison, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[TMP15:%.*]] = mul nsw <4 x i32> [[SHUFFLE]], [[TMP13]]
		; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x i32> poison, i32 [[TMP1]], i32 0
		; CHECK-NEXT: [[SHUFFLE36:%.*]] = shufflevector <4 x i32> [[TMP16]], <4 x i32> poison, <4 x i32> zeroinitializer
		; CHECK-NEXT: [[TMP17:%.*]] = add nsw <4 x i32> [[TMP15]], [[SHUFFLE36]]
		; CHECK-NEXT: [[TMP18:%.*]] = icmp ult <4 x i32> [[TMP17]], <i32 256, i32 256, i32 256, i32 256>
		; CHECK-NEXT: [[TMP19:%.*]] = icmp sgt <4 x i32> [[TMP17]], zeroinitializer
		; CHECK-NEXT: [[TMP20:%.*]] = sext <4 x i1> [[TMP19]] to <4 x i32>
		; CHECK-NEXT: [[TMP21:%.*]] = select <4 x i1> [[TMP18]], <4 x i32> [[TMP17]], <4 x i32> [[TMP20]]
		; CHECK-NEXT: [[TMP22:%.*]] = trunc <4 x i32> [[TMP21]] to <4 x i8>
		; CHECK-NEXT: [[TMP23:%.]] = bitcast i8 [[DST]] to <4 x i8>*
		; CHECK-NEXT: store <4 x i8> [[TMP22]], <4 x i8>* [[TMP23]], align 1, !alias.scope !3, !noalias !0
		; CHECK-NEXT: br label [[ENTRY_MERGE]]
;		;
entry:		entry:
%scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0		%scale = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 0
%0 = load i32, i32* %scale, align 16		%0 = load i32, i32* %scale, align 16
%offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1		%offset = getelementptr inbounds %struct.weight_t, %struct.weight_t* %w, i64 0, i32 1
%1 = load i32, i32* %offset, align 4		%1 = load i32, i32* %offset, align 4
%2 = load i8, i8* %src, align 1		%2 = load i8, i8* %src, align 1
%conv = zext i8 %2 to i32		%conv = zext i8 %2 to i32
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/memory-runtime-checks-in-loops.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -mtriple=arm64-apple-ios -S %s \| FileCheck %s			; RUN: opt -scoped-noalias-aa -slp-vectorizer -slp-memory-versioning -enable-new-pm=false -mtriple=arm64-apple-ios -S %s \| FileCheck %s
	; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -passes=slp-vectorizer -mtriple=arm64-apple-darwin -S %s \| FileCheck %s			; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -slp-memory-versioning -passes=slp-vectorizer -mtriple=arm64-apple-darwin -S %s \| FileCheck %s

	define void @loop1(i32* %A, i32* %B, i64 %N) {			define void @loop1(i32* %A, i32* %B, i64 %N) {
	; CHECK-LABEL: @loop1(			; CHECK-LABEL: @loop1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: [[A29:%.]] = ptrtoint i32 [[A:%.*]] to i64
				; CHECK-NEXT: [[B28:%.]] = ptrtoint i32 [[B:%.*]] to i64
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]			; CHECK-NEXT: [[INDVAR:%.]] = phi i64 [ [[INDVAR_NEXT:%.]], [[LOOP_TAIL:%.]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[IV]]			; CHECK-NEXT: [[IV:%.]] = phi i64 [ 0, [[ENTRY]] ], [ [[IV_NEXT:%.]], [[LOOP_TAIL]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = shl i64 [[INDVAR]], 6
				; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[B28]], [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.*]] = add i64 [[A29]], [[TMP0]]
				; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[IV]]
				; CHECK-NEXT: [[TMP3:%.*]] = sub i64 [[TMP1]], [[TMP2]]
				; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP3]], 16
				; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[LOOP_SCALAR:%.]], label [[LOOP_SLPVERSIONED1:%.]]
				; CHECK: loop.scalar:
	; CHECK-NEXT: [[B_0:%.]] = load i32, i32 [[B_GEP_0]], align 4			; CHECK-NEXT: [[B_0:%.]] = load i32, i32 [[B_GEP_0]], align 4
	; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[IV]]			; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[IV]]
	; CHECK-NEXT: [[A_0:%.]] = load i32, i32 [[A_GEP_0]], align 4			; CHECK-NEXT: [[A_0:%.]] = load i32, i32 [[A_GEP_0]], align 4
	; CHECK-NEXT: [[ADD_0:%.*]] = add i32 [[A_0]], 20			; CHECK-NEXT: [[ADD_0:%.*]] = add i32 [[A_0]], 20
	; CHECK-NEXT: [[XOR_0:%.*]] = xor i32 [[ADD_0]], [[B_0]]			; CHECK-NEXT: [[XOR_0:%.*]] = xor i32 [[ADD_0]], [[B_0]]
	; CHECK-NEXT: store i32 [[XOR_0]], i32* [[A_GEP_0]], align 4			; CHECK-NEXT: store i32 [[XOR_0]], i32* [[A_GEP_0]], align 4
	; CHECK-NEXT: [[IV_1:%.*]] = or i64 [[IV]], 1			; CHECK-NEXT: [[IV_1:%.*]] = or i64 [[IV]], 1
	; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[IV_1]]			; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[IV_1]]
	; CHECK-NEXT: [[B_1:%.]] = load i32, i32 [[B_GEP_1]], align 4			; CHECK-NEXT: [[B_1:%.]] = load i32, i32 [[B_GEP_1]], align 4
	; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[IV_1]]			; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[IV_1]]
	Show All 12 Lines
	; CHECK-NEXT: [[IV_3:%.*]] = or i64 [[IV]], 3			; CHECK-NEXT: [[IV_3:%.*]] = or i64 [[IV]], 3
	; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[IV_3]]			; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[IV_3]]
	; CHECK-NEXT: [[B_3:%.]] = load i32, i32 [[B_GEP_3]], align 4			; CHECK-NEXT: [[B_3:%.]] = load i32, i32 [[B_GEP_3]], align 4
	; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[IV_3]]			; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[IV_3]]
	; CHECK-NEXT: [[A_3:%.]] = load i32, i32 [[A_GEP_3]], align 4			; CHECK-NEXT: [[A_3:%.]] = load i32, i32 [[A_GEP_3]], align 4
	; CHECK-NEXT: [[ADD_3:%.*]] = add i32 [[A_3]], 20			; CHECK-NEXT: [[ADD_3:%.*]] = add i32 [[A_3]], 20
	; CHECK-NEXT: [[XOR_3:%.*]] = xor i32 [[ADD_3]], [[B_3]]			; CHECK-NEXT: [[XOR_3:%.*]] = xor i32 [[ADD_3]], [[B_3]]
	; CHECK-NEXT: store i32 [[XOR_3]], i32* [[A_GEP_3]], align 4			; CHECK-NEXT: store i32 [[XOR_3]], i32* [[A_GEP_3]], align 4
				; CHECK-NEXT: br label [[LOOP_MERGE:%.*]]
				; CHECK: loop.merge:
				; CHECK-NEXT: br label [[LOOP_TAIL]]
				; CHECK: loop.tail:
	; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 16			; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 16
	; CHECK-NEXT: [[COND:%.]] = icmp ult i64 [[IV_NEXT]], [[N:%.]]			; CHECK-NEXT: [[COND:%.]] = icmp ult i64 [[IV_NEXT]], [[N:%.]]
				; CHECK-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
	; CHECK-NEXT: br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]			; CHECK-NEXT: br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
				; CHECK: loop.slpversioned1:
				; CHECK-NEXT: [[A_GEP_03:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[IV]]
				; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[B_GEP_0]] to <4 x i32>*
				; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4, !alias.scope !0, !noalias !3
				; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[A_GEP_03]] to <4 x i32>*
				; CHECK-NEXT: [[TMP7:%.]] = load <4 x i32>, <4 x i32> [[TMP6]], align 4, !alias.scope !3, !noalias !0
				; CHECK-NEXT: [[TMP8:%.*]] = add <4 x i32> [[TMP7]], <i32 20, i32 20, i32 20, i32 20>
				; CHECK-NEXT: [[TMP9:%.*]] = xor <4 x i32> [[TMP8]], [[TMP5]]
				; CHECK-NEXT: [[TMP10:%.]] = bitcast i32 [[A_GEP_03]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP9]], <4 x i32>* [[TMP10]], align 4, !alias.scope !3, !noalias !0
				; CHECK-NEXT: br label [[LOOP_MERGE]]
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
	%B.gep.0 = getelementptr inbounds i32, i32* %B, i64 %iv			%B.gep.0 = getelementptr inbounds i32, i32* %B, i64 %iv
	%B.0 = load i32, i32* %B.gep.0, align 4			%B.0 = load i32, i32* %B.gep.0, align 4
	Show All 32 Lines

	exit:			exit:
	ret void			ret void
	}			}

	define void @loop_iv_update_at_start(float* %src, float* %dst) #0 {			define void @loop_iv_update_at_start(float* %src, float* %dst) #0 {
	; CHECK-LABEL: @loop_iv_update_at_start(			; CHECK-LABEL: @loop_iv_update_at_start(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: [[DST27:%.]] = ptrtoint float [[DST:%.*]] to i64
				; CHECK-NEXT: [[SRC26:%.]] = ptrtoint float [[SRC:%.*]] to i64
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.*]], [[LOOP]] ]			; CHECK-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[IV_NEXT:%.]], [[LOOP_MERGE:%.]] ]
	; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1			; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
	; CHECK-NEXT: [[COND:%.*]] = icmp ult i32 [[IV]], 2000			; CHECK-NEXT: [[COND:%.*]] = icmp ult i32 [[IV]], 2000
	; CHECK-NEXT: [[SRC_GEP_0:%.]] = getelementptr inbounds float, float [[SRC:%.*]], i64 0			; CHECK-NEXT: [[SRC_GEP_0:%.]] = getelementptr inbounds float, float [[SRC]], i64 0
				; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[SRC26]], [[DST27]]
				; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 20
				; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[LOOP_SCALAR:%.]], label [[LOOP_SLPVERSIONED1:%.]]
				; CHECK: loop.scalar:
	; CHECK-NEXT: [[SRC_0:%.]] = load float, float [[SRC_GEP_0]], align 8			; CHECK-NEXT: [[SRC_0:%.]] = load float, float [[SRC_GEP_0]], align 8
	; CHECK-NEXT: [[ADD_0:%.*]] = fadd float [[SRC_0]], 1.000000e+00			; CHECK-NEXT: [[ADD_0:%.*]] = fadd float [[SRC_0]], 1.000000e+00
	; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[ADD_0]], [[SRC_0]]			; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[ADD_0]], [[SRC_0]]
	; CHECK-NEXT: [[DST_GEP_0:%.]] = getelementptr inbounds float, float [[DST:%.*]], i64 0			; CHECK-NEXT: [[DST_GEP_0:%.]] = getelementptr inbounds float, float [[DST]], i64 0
	; CHECK-NEXT: store float [[MUL_0]], float* [[DST_GEP_0]], align 8			; CHECK-NEXT: store float [[MUL_0]], float* [[DST_GEP_0]], align 8
	; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds float, float [[SRC]], i64 1			; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds float, float [[SRC]], i64 1
	; CHECK-NEXT: [[SRC_1:%.]] = load float, float [[SRC_GEP_1]], align 8			; CHECK-NEXT: [[SRC_1:%.]] = load float, float [[SRC_GEP_1]], align 8
	; CHECK-NEXT: [[ADD_1:%.*]] = fadd float [[SRC_1]], 1.000000e+00			; CHECK-NEXT: [[ADD_1:%.*]] = fadd float [[SRC_1]], 1.000000e+00
	; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[ADD_1]], [[SRC_1]]			; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[ADD_1]], [[SRC_1]]
	; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds float, float [[DST]], i64 1			; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds float, float [[DST]], i64 1
	; CHECK-NEXT: store float [[MUL_1]], float* [[DST_GEP_1]], align 8			; CHECK-NEXT: store float [[MUL_1]], float* [[DST_GEP_1]], align 8
	; CHECK-NEXT: [[SRC_GEP_2:%.]] = getelementptr inbounds float, float [[SRC]], i64 2			; CHECK-NEXT: [[SRC_GEP_2:%.]] = getelementptr inbounds float, float [[SRC]], i64 2
	Show All 9 Lines
	; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds float, float [[DST]], i64 3			; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds float, float [[DST]], i64 3
	; CHECK-NEXT: store float [[MUL_3]], float* [[DST_GEP_3]], align 8			; CHECK-NEXT: store float [[MUL_3]], float* [[DST_GEP_3]], align 8
	; CHECK-NEXT: [[SRC_GEP_4:%.]] = getelementptr inbounds float, float [[SRC]], i64 4			; CHECK-NEXT: [[SRC_GEP_4:%.]] = getelementptr inbounds float, float [[SRC]], i64 4
	; CHECK-NEXT: [[SRC_4:%.]] = load float, float [[SRC_GEP_4]], align 8			; CHECK-NEXT: [[SRC_4:%.]] = load float, float [[SRC_GEP_4]], align 8
	; CHECK-NEXT: [[ADD_4:%.*]] = fadd float [[SRC_4]], 1.000000e+00			; CHECK-NEXT: [[ADD_4:%.*]] = fadd float [[SRC_4]], 1.000000e+00
	; CHECK-NEXT: [[MUL_4:%.*]] = fmul float [[ADD_4]], [[SRC_4]]			; CHECK-NEXT: [[MUL_4:%.*]] = fmul float [[ADD_4]], [[SRC_4]]
	; CHECK-NEXT: [[DST_GEP_4:%.]] = getelementptr inbounds float, float [[DST]], i64 4			; CHECK-NEXT: [[DST_GEP_4:%.]] = getelementptr inbounds float, float [[DST]], i64 4
	; CHECK-NEXT: store float [[MUL_4]], float* [[DST_GEP_4]], align 8			; CHECK-NEXT: store float [[MUL_4]], float* [[DST_GEP_4]], align 8
				; CHECK-NEXT: br label [[LOOP_MERGE]]
				; CHECK: loop.merge:
	; CHECK-NEXT: br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]			; CHECK-NEXT: br i1 [[COND]], label [[LOOP]], label [[EXIT:%.*]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
				; CHECK: loop.slpversioned1:
				; CHECK-NEXT: [[DST_GEP_05:%.]] = getelementptr inbounds float, float [[DST]], i64 0
				; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[SRC_GEP_0]] to <4 x float>*
				; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 8, !alias.scope !5, !noalias !8
				; CHECK-NEXT: [[TMP3:%.*]] = fadd <4 x float> [[TMP2]], <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>
				; CHECK-NEXT: [[TMP4:%.*]] = fmul <4 x float> [[TMP3]], [[TMP2]]
				; CHECK-NEXT: [[TMP5:%.]] = bitcast float [[DST_GEP_05]] to <4 x float>*
				; CHECK-NEXT: store <4 x float> [[TMP4]], <4 x float>* [[TMP5]], align 8, !alias.scope !8, !noalias !5
				; CHECK-NEXT: [[SRC_GEP_421:%.]] = getelementptr inbounds float, float [[SRC]], i64 4
				; CHECK-NEXT: [[SRC_422:%.]] = load float, float [[SRC_GEP_421]], align 8, !alias.scope !5, !noalias !8
				; CHECK-NEXT: [[ADD_423:%.*]] = fadd float [[SRC_422]], 1.000000e+00
				; CHECK-NEXT: [[MUL_424:%.*]] = fmul float [[ADD_423]], [[SRC_422]]
				; CHECK-NEXT: [[DST_GEP_425:%.]] = getelementptr inbounds float, float [[DST]], i64 4
				; CHECK-NEXT: store float [[MUL_424]], float* [[DST_GEP_425]], align 8, !alias.scope !8, !noalias !5
				; CHECK-NEXT: br label [[LOOP_MERGE]]
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]			%iv = phi i32 [ 0, %entry ], [ %iv.next, %loop ]
	%iv.next = add i32 %iv, 1			%iv.next = add i32 %iv, 1
	%cond = icmp ult i32 %iv, 2000			%cond = icmp ult i32 %iv, 2000
	▲ Show 20 Lines • Show All 129 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/memory-runtime-checks.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -scoped-noalias-aa -slp-vectorizer -mtriple=arm64-apple-darwin -enable-new-pm=false -S %s \| FileCheck %s		; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -slp-memory-versioning -passes=slp-vectorizer -mtriple=arm64-apple-darwin -S %s \| FileCheck %s
; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -passes=slp-vectorizer -mtriple=arm64-apple-darwin -S %s \| FileCheck %s		; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -slp-memory-versioning=false -passes=slp-vectorizer -mtriple=arm64-apple-darwin -S %s \| FileCheck --check-prefix=NOVERSION %s

		; NOVERSION-NOT: slpversioned

define void @needs_versioning_not_profitable(i32* %dst, i32* %src) {		define void @needs_versioning_not_profitable(i32* %dst, i32* %src) {
; CHECK-LABEL: @needs_versioning_not_profitable(		; CHECK-LABEL: @needs_versioning_not_profitable(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[SRC_0:%.]] = load i32, i32 [[SRC:%.*]], align 4		; CHECK-NEXT: [[SRC_0:%.]] = load i32, i32 [[SRC:%.*]], align 4
; CHECK-NEXT: [[R_0:%.*]] = ashr i32 [[SRC_0]], 16		; CHECK-NEXT: [[R_0:%.*]] = ashr i32 [[SRC_0]], 16
; CHECK-NEXT: store i32 [[R_0]], i32* [[DST:%.*]], align 4		; CHECK-NEXT: store i32 [[R_0]], i32* [[DST:%.*]], align 4
; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 1		; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 1
Show All 13 Lines	entry:
%dst.gep.1 = getelementptr inbounds i32, i32* %dst, i64 1		%dst.gep.1 = getelementptr inbounds i32, i32* %dst, i64 1
store i32 %r.1, i32* %dst.gep.1, align 4		store i32 %r.1, i32* %dst.gep.1, align 4
ret void		ret void
}		}

define void @needs_versioning_profitable(i32* %dst, i32* %src) {		define void @needs_versioning_profitable(i32* %dst, i32* %src) {
; CHECK-LABEL: @needs_versioning_profitable(		; CHECK-LABEL: @needs_versioning_profitable(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[SRC_0:%.]] = load i32, i32 [[SRC:%.*]], align 4		; CHECK-NEXT: [[DST17:%.]] = ptrtoint i32 [[DST:%.*]] to i64
		; CHECK-NEXT: [[SRC16:%.]] = ptrtoint i32 [[SRC:%.*]] to i64
		; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[SRC16]], [[DST17]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[ENTRY_SCALAR:%.]], label [[ENTRY_SLPVERSIONED1:%.]]
		; CHECK: entry.scalar:
		; CHECK-NEXT: [[SRC_0:%.]] = load i32, i32 [[SRC]], align 4
; CHECK-NEXT: [[R_0:%.*]] = ashr i32 [[SRC_0]], 16		; CHECK-NEXT: [[R_0:%.*]] = ashr i32 [[SRC_0]], 16
; CHECK-NEXT: store i32 [[R_0]], i32* [[DST:%.*]], align 4		; CHECK-NEXT: store i32 [[R_0]], i32* [[DST]], align 4
; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 1		; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 1
; CHECK-NEXT: [[SRC_1:%.]] = load i32, i32 [[SRC_GEP_1]], align 4		; CHECK-NEXT: [[SRC_1:%.]] = load i32, i32 [[SRC_GEP_1]], align 4
; CHECK-NEXT: [[R_1:%.*]] = ashr i32 [[SRC_1]], 16		; CHECK-NEXT: [[R_1:%.*]] = ashr i32 [[SRC_1]], 16
; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 1		; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 1
; CHECK-NEXT: store i32 [[R_1]], i32* [[DST_GEP_1]], align 4		; CHECK-NEXT: store i32 [[R_1]], i32* [[DST_GEP_1]], align 4
; CHECK-NEXT: [[SRC_GEP_2:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 2		; CHECK-NEXT: [[SRC_GEP_2:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 2
; CHECK-NEXT: [[SRC_2:%.]] = load i32, i32 [[SRC_GEP_2]], align 4		; CHECK-NEXT: [[SRC_2:%.]] = load i32, i32 [[SRC_GEP_2]], align 4
; CHECK-NEXT: [[R_2:%.*]] = ashr i32 [[SRC_2]], 16		; CHECK-NEXT: [[R_2:%.*]] = ashr i32 [[SRC_2]], 16
; CHECK-NEXT: [[DST_GEP_2:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 2		; CHECK-NEXT: [[DST_GEP_2:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 2
; CHECK-NEXT: store i32 [[R_2]], i32* [[DST_GEP_2]], align 4		; CHECK-NEXT: store i32 [[R_2]], i32* [[DST_GEP_2]], align 4
; CHECK-NEXT: [[SRC_GEP_3:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 3		; CHECK-NEXT: [[SRC_GEP_3:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 3
; CHECK-NEXT: [[SRC_3:%.]] = load i32, i32 [[SRC_GEP_3]], align 4		; CHECK-NEXT: [[SRC_3:%.]] = load i32, i32 [[SRC_GEP_3]], align 4
; CHECK-NEXT: [[R_3:%.*]] = ashr i32 [[SRC_3]], 16		; CHECK-NEXT: [[R_3:%.*]] = ashr i32 [[SRC_3]], 16
; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 3		; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 3
; CHECK-NEXT: store i32 [[R_3]], i32* [[DST_GEP_3]], align 4		; CHECK-NEXT: store i32 [[R_3]], i32* [[DST_GEP_3]], align 4
		; CHECK-NEXT: br label [[ENTRY_MERGE:%.*]]
		; CHECK: entry.merge:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: entry.slpversioned1:
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[SRC]] to <4 x i32>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4, !alias.scope !0, !noalias !3
		; CHECK-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP2]], <i32 16, i32 16, i32 16, i32 16>
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[DST]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 4, !alias.scope !3, !noalias !0
		; CHECK-NEXT: br label [[ENTRY_MERGE]]
;		;
entry:		entry:
%src.0 = load i32, i32* %src, align 4		%src.0 = load i32, i32* %src, align 4
%r.0 = ashr i32 %src.0, 16		%r.0 = ashr i32 %src.0, 16
store i32 %r.0, i32* %dst, align 4		store i32 %r.0, i32* %dst, align 4
%src.gep.1 = getelementptr inbounds i32, i32* %src, i64 1		%src.gep.1 = getelementptr inbounds i32, i32* %src, i64 1
%src.1 = load i32, i32* %src.gep.1, align 4		%src.1 = load i32, i32* %src.gep.1, align 4
%r.1 = ashr i32 %src.1, 16		%r.1 = ashr i32 %src.1, 16
Show All 11 Lines	entry:
store i32 %r.3, i32* %dst.gep.3, align 4		store i32 %r.3, i32* %dst.gep.3, align 4

ret void		ret void
}		}

define void @needs_versioning_profitable_2_sources(i32* %dst, i32* %A, i32* %B) {		define void @needs_versioning_profitable_2_sources(i32* %dst, i32* %A, i32* %B) {
; CHECK-LABEL: @needs_versioning_profitable_2_sources(		; CHECK-LABEL: @needs_versioning_profitable_2_sources(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[A_0:%.]] = load i32, i32 [[A:%.*]], align 4		; CHECK-NEXT: [[B29:%.]] = ptrtoint i32 [[B:%.*]] to i64
; CHECK-NEXT: [[B_0:%.]] = load i32, i32 [[B:%.*]], align 4		; CHECK-NEXT: [[DST28:%.]] = ptrtoint i32 [[DST:%.*]] to i64
		; CHECK-NEXT: [[A27:%.]] = ptrtoint i32 [[A:%.*]] to i64
		; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[A27]], [[DST28]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
		; CHECK-NEXT: [[TMP1:%.*]] = sub i64 [[B29]], [[DST28]]
		; CHECK-NEXT: [[DIFF_CHECK30:%.*]] = icmp ult i64 [[TMP1]], 16
		; CHECK-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK30]]
		; CHECK-NEXT: br i1 [[CONFLICT_RDX]], label [[ENTRY_SCALAR:%.]], label [[ENTRY_SLPVERSIONED1:%.]]
		; CHECK: entry.scalar:
		; CHECK-NEXT: [[A_0:%.]] = load i32, i32 [[A]], align 4
		; CHECK-NEXT: [[B_0:%.]] = load i32, i32 [[B]], align 4
; CHECK-NEXT: [[R_0:%.*]] = add i32 [[A_0]], [[B_0]]		; CHECK-NEXT: [[R_0:%.*]] = add i32 [[A_0]], [[B_0]]
; CHECK-NEXT: [[MUL_0:%.*]] = mul i32 [[R_0]], 2		; CHECK-NEXT: [[MUL_0:%.*]] = mul i32 [[R_0]], 2
; CHECK-NEXT: store i32 [[MUL_0]], i32* [[DST:%.*]], align 4		; CHECK-NEXT: store i32 [[MUL_0]], i32* [[DST]], align 4
; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1		; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; CHECK-NEXT: [[A_1:%.]] = load i32, i32 [[A_GEP_1]], align 4		; CHECK-NEXT: [[A_1:%.]] = load i32, i32 [[A_GEP_1]], align 4
; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 1		; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds i32, i32 [[B]], i64 1
; CHECK-NEXT: [[B_1:%.]] = load i32, i32 [[B_GEP_1]], align 4		; CHECK-NEXT: [[B_1:%.]] = load i32, i32 [[B_GEP_1]], align 4
; CHECK-NEXT: [[R_1:%.*]] = add i32 [[A_1]], [[B_1]]		; CHECK-NEXT: [[R_1:%.*]] = add i32 [[A_1]], [[B_1]]
; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[R_1]], 2		; CHECK-NEXT: [[MUL_1:%.*]] = mul i32 [[R_1]], 2
; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 1		; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 1
; CHECK-NEXT: store i32 [[MUL_1]], i32* [[DST_GEP_1]], align 4		; CHECK-NEXT: store i32 [[MUL_1]], i32* [[DST_GEP_1]], align 4
; CHECK-NEXT: [[A_GEP_2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2		; CHECK-NEXT: [[A_GEP_2:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; CHECK-NEXT: [[A_2:%.]] = load i32, i32 [[A_GEP_2]], align 4		; CHECK-NEXT: [[A_2:%.]] = load i32, i32 [[A_GEP_2]], align 4
; CHECK-NEXT: [[B_GEP_2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; CHECK-NEXT: [[B_GEP_2:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; CHECK-NEXT: [[B_2:%.]] = load i32, i32 [[B_GEP_2]], align 4		; CHECK-NEXT: [[B_2:%.]] = load i32, i32 [[B_GEP_2]], align 4
; CHECK-NEXT: [[R_2:%.*]] = add i32 [[A_2]], [[B_2]]		; CHECK-NEXT: [[R_2:%.*]] = add i32 [[A_2]], [[B_2]]
; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[R_2]], 2		; CHECK-NEXT: [[MUL_2:%.*]] = mul i32 [[R_2]], 2
; CHECK-NEXT: [[DST_GEP_2:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 2		; CHECK-NEXT: [[DST_GEP_2:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 2
; CHECK-NEXT: store i32 [[MUL_2]], i32* [[DST_GEP_2]], align 4		; CHECK-NEXT: store i32 [[MUL_2]], i32* [[DST_GEP_2]], align 4
; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; CHECK-NEXT: [[A_3:%.]] = load i32, i32 [[A_GEP_3]], align 4		; CHECK-NEXT: [[A_3:%.]] = load i32, i32 [[A_GEP_3]], align 4
; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; CHECK-NEXT: [[B_3:%.]] = load i32, i32 [[B_GEP_3]], align 4		; CHECK-NEXT: [[B_3:%.]] = load i32, i32 [[B_GEP_3]], align 4
; CHECK-NEXT: [[R_3:%.*]] = add i32 [[A_3]], [[B_3]]		; CHECK-NEXT: [[R_3:%.*]] = add i32 [[A_3]], [[B_3]]
; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[R_3]], 2		; CHECK-NEXT: [[MUL_3:%.*]] = mul i32 [[R_3]], 2
; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 3		; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 3
; CHECK-NEXT: store i32 [[MUL_3]], i32* [[DST_GEP_3]], align 4		; CHECK-NEXT: store i32 [[MUL_3]], i32* [[DST_GEP_3]], align 4
		; CHECK-NEXT: br label [[ENTRY_MERGE:%.*]]
		; CHECK: entry.merge:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: entry.slpversioned1:
		; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[A]] to <4 x i32>*
		; CHECK-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4, !alias.scope !5, !noalias !8
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[B]] to <4 x i32>*
		; CHECK-NEXT: [[TMP5:%.]] = load <4 x i32>, <4 x i32> [[TMP4]], align 4, !alias.scope !11, !noalias !12
		; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[TMP3]], [[TMP5]]
		; CHECK-NEXT: [[TMP7:%.*]] = mul <4 x i32> [[TMP6]], <i32 2, i32 2, i32 2, i32 2>
		; CHECK-NEXT: [[TMP8:%.]] = bitcast i32 [[DST]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP8]], align 4, !alias.scope !13, !noalias !14
		; CHECK-NEXT: br label [[ENTRY_MERGE]]
;		;
entry:		entry:
%A.0 = load i32, i32* %A, align 4		%A.0 = load i32, i32* %A, align 4
%B.0 = load i32, i32* %B, align 4		%B.0 = load i32, i32* %B, align 4
%r.0 = add i32 %A.0, %B.0		%r.0 = add i32 %A.0, %B.0
%mul.0 = mul i32 %r.0, 2		%mul.0 = mul i32 %r.0, 2
store i32 %mul.0, i32* %dst, align 4		store i32 %mul.0, i32* %dst, align 4
%A.gep.1 = getelementptr inbounds i32, i32* %A, i64 1		%A.gep.1 = getelementptr inbounds i32, i32* %A, i64 1
Show All 26 Lines

declare void @use(i32)		declare void @use(i32)

declare void @bar()		declare void @bar()

define void @needs_versioning_profitable_split_points(i32* %dst, i32* %src) {		define void @needs_versioning_profitable_split_points(i32* %dst, i32* %src) {
; CHECK-LABEL: @needs_versioning_profitable_split_points(		; CHECK-LABEL: @needs_versioning_profitable_split_points(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[DST17:%.]] = ptrtoint i32 [[DST:%.*]] to i64
		; CHECK-NEXT: [[SRC16:%.]] = ptrtoint i32 [[SRC:%.*]] to i64
; CHECK-NEXT: call void @bar()		; CHECK-NEXT: call void @bar()
; CHECK-NEXT: call void @bar()		; CHECK-NEXT: call void @bar()
; CHECK-NEXT: call void @bar()		; CHECK-NEXT: call void @bar()
; CHECK-NEXT: [[SRC_0:%.]] = load i32, i32 [[SRC:%.*]], align 4		; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[SRC16]], [[DST17]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[ENTRY_SCALAR:%.]], label [[ENTRY_SLPVERSIONED1:%.]]
		; CHECK: entry.scalar:
		; CHECK-NEXT: [[SRC_0:%.]] = load i32, i32 [[SRC]], align 4
; CHECK-NEXT: [[R_0:%.*]] = ashr i32 [[SRC_0]], 16		; CHECK-NEXT: [[R_0:%.*]] = ashr i32 [[SRC_0]], 16
; CHECK-NEXT: store i32 [[R_0]], i32* [[DST:%.*]], align 4		; CHECK-NEXT: store i32 [[R_0]], i32* [[DST]], align 4
; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 1		; CHECK-NEXT: [[SRC_GEP_1:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 1
; CHECK-NEXT: [[SRC_1:%.]] = load i32, i32 [[SRC_GEP_1]], align 4		; CHECK-NEXT: [[SRC_1:%.]] = load i32, i32 [[SRC_GEP_1]], align 4
; CHECK-NEXT: [[R_1:%.*]] = ashr i32 [[SRC_1]], 16		; CHECK-NEXT: [[R_1:%.*]] = ashr i32 [[SRC_1]], 16
; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 1		; CHECK-NEXT: [[DST_GEP_1:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 1
; CHECK-NEXT: store i32 [[R_1]], i32* [[DST_GEP_1]], align 4		; CHECK-NEXT: store i32 [[R_1]], i32* [[DST_GEP_1]], align 4
; CHECK-NEXT: [[SRC_GEP_2:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 2		; CHECK-NEXT: [[SRC_GEP_2:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 2
; CHECK-NEXT: [[SRC_2:%.]] = load i32, i32 [[SRC_GEP_2]], align 4		; CHECK-NEXT: [[SRC_2:%.]] = load i32, i32 [[SRC_GEP_2]], align 4
; CHECK-NEXT: [[R_2:%.*]] = ashr i32 [[SRC_2]], 16		; CHECK-NEXT: [[R_2:%.*]] = ashr i32 [[SRC_2]], 16
; CHECK-NEXT: [[DST_GEP_2:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 2		; CHECK-NEXT: [[DST_GEP_2:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 2
; CHECK-NEXT: store i32 [[R_2]], i32* [[DST_GEP_2]], align 4		; CHECK-NEXT: store i32 [[R_2]], i32* [[DST_GEP_2]], align 4
; CHECK-NEXT: [[SRC_GEP_3:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 3		; CHECK-NEXT: [[SRC_GEP_3:%.]] = getelementptr inbounds i32, i32 [[SRC]], i64 3
; CHECK-NEXT: [[SRC_3:%.]] = load i32, i32 [[SRC_GEP_3]], align 4		; CHECK-NEXT: [[SRC_3:%.]] = load i32, i32 [[SRC_GEP_3]], align 4
; CHECK-NEXT: [[R_3:%.*]] = ashr i32 [[SRC_3]], 16		; CHECK-NEXT: [[R_3:%.*]] = ashr i32 [[SRC_3]], 16
; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 3		; CHECK-NEXT: [[DST_GEP_3:%.]] = getelementptr inbounds i32, i32 [[DST]], i64 3
; CHECK-NEXT: store i32 [[R_3]], i32* [[DST_GEP_3]], align 4		; CHECK-NEXT: store i32 [[R_3]], i32* [[DST_GEP_3]], align 4
		; CHECK-NEXT: br label [[ENTRY_MERGE:%.*]]
		; CHECK: entry.merge:
		; CHECK-NEXT: br label [[ENTRY_TAIL:%.*]]
		; CHECK: entry.tail:
; CHECK-NEXT: call void @bar()		; CHECK-NEXT: call void @bar()
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: entry.slpversioned1:
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[SRC]] to <4 x i32>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4, !alias.scope !15, !noalias !18
		; CHECK-NEXT: [[TMP3:%.*]] = ashr <4 x i32> [[TMP2]], <i32 16, i32 16, i32 16, i32 16>
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[DST]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]], align 4, !alias.scope !18, !noalias !15
		; CHECK-NEXT: br label [[ENTRY_MERGE]]
;		;
entry:		entry:
call void @bar()		call void @bar()
call void @bar()		call void @bar()
call void @bar()		call void @bar()

%src.0 = load i32, i32* %src, align 4		%src.0 = load i32, i32* %src, align 4
%r.0 = ashr i32 %src.0, 16		%r.0 = ashr i32 %src.0, 16
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines	entry:
store i32 %r.0, i32* %dst, align 4		store i32 %r.0, i32* %dst, align 4
store i32 %r.1, i32* %dst.gep.1, align 4		store i32 %r.1, i32* %dst.gep.1, align 4
ret void		ret void
}		}

define void @version_multiple(i32* nocapture %out_block, i32* nocapture readonly %counter) {		define void @version_multiple(i32* nocapture %out_block, i32* nocapture readonly %counter) {
; CHECK-LABEL: @version_multiple(		; CHECK-LABEL: @version_multiple(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[COUNTER:%.*]], align 4		; CHECK-NEXT: [[OUT_BLOCK13:%.]] = ptrtoint i32 [[OUT_BLOCK:%.*]] to i64
; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[OUT_BLOCK:%.*]], align 4		; CHECK-NEXT: [[COUNTER12:%.]] = ptrtoint i32 [[COUNTER:%.*]] to i64
; CHECK-NEXT: [[XOR:%.*]] = xor i32 [[TMP1]], [[TMP0]]		; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[COUNTER12]], [[OUT_BLOCK13]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[ENTRY_SCALAR:%.]], label [[ENTRY_SLPVERSIONED1:%.]]
		; CHECK: entry.scalar:
		; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[COUNTER]], align 4
		; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[OUT_BLOCK]], align 4
		; CHECK-NEXT: [[XOR:%.*]] = xor i32 [[TMP2]], [[TMP1]]
; CHECK-NEXT: store i32 [[XOR]], i32* [[OUT_BLOCK]], align 4		; CHECK-NEXT: store i32 [[XOR]], i32* [[OUT_BLOCK]], align 4
; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 1		; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 1
; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4		; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4
; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 1		; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 1
; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX2_1]], align 4		; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX2_1]], align 4
; CHECK-NEXT: [[XOR_1:%.*]] = xor i32 [[TMP3]], [[TMP2]]		; CHECK-NEXT: [[XOR_1:%.*]] = xor i32 [[TMP4]], [[TMP3]]
; CHECK-NEXT: store i32 [[XOR_1]], i32* [[ARRAYIDX2_1]], align 4		; CHECK-NEXT: store i32 [[XOR_1]], i32* [[ARRAYIDX2_1]], align 4
; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 2		; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 2
; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4		; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4
; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 2		; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 2
; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX2_2]], align 4		; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX2_2]], align 4
; CHECK-NEXT: [[XOR_2:%.*]] = xor i32 [[TMP5]], [[TMP4]]		; CHECK-NEXT: [[XOR_2:%.*]] = xor i32 [[TMP6]], [[TMP5]]
; CHECK-NEXT: store i32 [[XOR_2]], i32* [[ARRAYIDX2_2]], align 4		; CHECK-NEXT: store i32 [[XOR_2]], i32* [[ARRAYIDX2_2]], align 4
; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 3		; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 3
; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4		; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4
; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 3		; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 3
; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX2_3]], align 4		; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX2_3]], align 4
; CHECK-NEXT: [[XOR_3:%.*]] = xor i32 [[TMP7]], [[TMP6]]		; CHECK-NEXT: [[XOR_3:%.*]] = xor i32 [[TMP8]], [[TMP7]]
; CHECK-NEXT: store i32 [[XOR_3]], i32* [[ARRAYIDX2_3]], align 4		; CHECK-NEXT: store i32 [[XOR_3]], i32* [[ARRAYIDX2_3]], align 4
		; CHECK-NEXT: br label [[ENTRY_MERGE:%.*]]
		; CHECK: entry.merge:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: entry.slpversioned1:
		; CHECK-NEXT: [[TMP9:%.]] = bitcast i32 [[COUNTER]] to <4 x i32>*
		; CHECK-NEXT: [[TMP10:%.]] = load <4 x i32>, <4 x i32> [[TMP9]], align 4, !alias.scope !20, !noalias !23
		; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[OUT_BLOCK]] to <4 x i32>*
		; CHECK-NEXT: [[TMP12:%.]] = load <4 x i32>, <4 x i32> [[TMP11]], align 4, !alias.scope !23, !noalias !20
		; CHECK-NEXT: [[TMP13:%.*]] = xor <4 x i32> [[TMP12]], [[TMP10]]
		; CHECK-NEXT: [[TMP14:%.]] = bitcast i32 [[OUT_BLOCK]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* [[TMP14]], align 4, !alias.scope !23, !noalias !20
		; CHECK-NEXT: br label [[ENTRY_MERGE]]
;		;
entry:		entry:
%0 = load i32, i32* %counter, align 4		%0 = load i32, i32* %counter, align 4
%1 = load i32, i32* %out_block, align 4		%1 = load i32, i32* %out_block, align 4
%xor = xor i32 %1, %0		%xor = xor i32 %1, %0
store i32 %xor, i32* %out_block, align 4		store i32 %xor, i32* %out_block, align 4
%arrayidx.1 = getelementptr inbounds i32, i32* %counter, i64 1		%arrayidx.1 = getelementptr inbounds i32, i32* %counter, i64 1
%2 = load i32, i32* %arrayidx.1, align 4		%2 = load i32, i32* %arrayidx.1, align 4
▲ Show 20 Lines • Show All 222 Lines • ▼ Show 20 Lines
%struct = type { i32, i32, float, float }		%struct = type { i32, i32, float, float }

; Some points we collected as candidates for runtime checks have been removed		; Some points we collected as candidates for runtime checks have been removed
; before generating runtime checks. Make sure versioning is skipped.		; before generating runtime checks. Make sure versioning is skipped.
define void @test_bounds_removed_before_runtime_checks(%struct * %A, i32** %B, i1 %c) {		define void @test_bounds_removed_before_runtime_checks(%struct * %A, i32** %B, i1 %c) {
; CHECK-LABEL: @test_bounds_removed_before_runtime_checks(		; CHECK-LABEL: @test_bounds_removed_before_runtime_checks(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT:%.]], %struct* [[A:%.*]], i64 0, i32 0		; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds [[STRUCT:%.]], %struct* [[A:%.*]], i64 0, i32 0
		; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds [[STRUCT]], %struct [[A]], i64 0, i32 1
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[TMP11]] to <2 x i32>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[TMP11]] to <2 x i32>*
; CHECK-NEXT: store <2 x i32> <i32 10, i32 300>, <2 x i32>* [[TMP0]], align 8		; CHECK-NEXT: store <2 x i32> <i32 10, i32 300>, <2 x i32>* [[TMP0]], align 8
; CHECK-NEXT: [[TMP13:%.]] = load i32, i32** [[B:%.*]], align 8		; CHECK-NEXT: [[TMP13:%.]] = load i32, i32** [[B:%.*]], align 8
; CHECK-NEXT: br i1 [[C:%.]], label [[BB23:%.]], label [[BB14:%.*]]		; CHECK-NEXT: br i1 [[C:%.]], label [[BB23:%.]], label [[BB14:%.*]]
; CHECK: bb14:		; CHECK: bb14:
; CHECK-NEXT: [[TMP15:%.*]] = sext i32 10 to i64		; CHECK-NEXT: [[TMP15:%.*]] = sext i32 10 to i64
; CHECK-NEXT: [[TMP16:%.*]] = add nsw i64 2, [[TMP15]]		; CHECK-NEXT: [[TMP16:%.*]] = add nsw i64 2, [[TMP15]]
; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP13]], i64 [[TMP16]]		; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[TMP13]], i64 [[TMP16]]
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
define void @no_lcssa_phi(%struct.2* %A, float* %B, i1 %c) {		define void @no_lcssa_phi(%struct.2* %A, float* %B, i1 %c) {
; CHECK-LABEL: @no_lcssa_phi(		; CHECK-LABEL: @no_lcssa_phi(
; CHECK-NEXT: bb:		; CHECK-NEXT: bb:
; CHECK-NEXT: br label [[LOOP:%.*]]		; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:		; CHECK: loop:
; CHECK-NEXT: [[PTR_PHI:%.]] = phi %struct.2 [ [[A:%.]], [[BB:%.]] ], [ null, [[LOOP]] ]		; CHECK-NEXT: [[PTR_PHI:%.]] = phi %struct.2 [ [[A:%.]], [[BB:%.]] ], [ null, [[LOOP]] ]
; CHECK-NEXT: br i1 [[C:%.]], label [[EXIT:%.]], label [[LOOP]]		; CHECK-NEXT: br i1 [[C:%.]], label [[EXIT:%.]], label [[LOOP]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 0		; CHECK-NEXT: [[PTR_PHI_LCSSA:%.]] = phi %struct.2 [ [[PTR_PHI]], [[LOOP]] ]
		; CHECK-NEXT: [[PTR_PHI_LCSSA22:%.]] = ptrtoint %struct.2 [[PTR_PHI_LCSSA]] to i64
		; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr float, float [[B:%.*]], i64 0
		; CHECK-NEXT: [[B_GEP_021:%.]] = ptrtoint float [[B_GEP_0]] to i64
		; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[B_GEP_021]], [[PTR_PHI_LCSSA22]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[EXIT_SCALAR:%.]], label [[EXIT_SLPVERSIONED1:%.]]
		; CHECK: exit.scalar:
; CHECK-NEXT: [[L_0:%.]] = load float, float [[B_GEP_0]], align 8		; CHECK-NEXT: [[L_0:%.]] = load float, float [[B_GEP_0]], align 8
; CHECK-NEXT: [[ADD_0:%.*]] = fadd float [[L_0]], 1.000000e+01		; CHECK-NEXT: [[ADD_0:%.*]] = fadd float [[L_0]], 1.000000e+01
; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[ADD_0]], 3.000000e+01		; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[ADD_0]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds [[STRUCT_2:%.]], %struct.2* [[PTR_PHI]], i64 0, i32 0, i32 0		; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds [[STRUCT_2:%.]], %struct.2* [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 0
; CHECK-NEXT: store float [[MUL_0]], float* [[A_GEP_0]], align 8		; CHECK-NEXT: store float [[MUL_0]], float* [[A_GEP_0]], align 8
; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds float, float [[B]], i64 1		; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds float, float [[B]], i64 1
; CHECK-NEXT: [[L_1:%.]] = load float, float [[B_GEP_1]], align 8		; CHECK-NEXT: [[L_1:%.]] = load float, float [[B_GEP_1]], align 8
; CHECK-NEXT: [[ADD_1:%.*]] = fadd float [[L_1]], 1.000000e+01		; CHECK-NEXT: [[ADD_1:%.*]] = fadd float [[L_1]], 1.000000e+01
; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[ADD_1]], 3.000000e+01		; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[ADD_1]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI]], i64 0, i32 0, i32 1		; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 1
; CHECK-NEXT: store float [[MUL_1]], float* [[A_GEP_1]], align 8		; CHECK-NEXT: store float [[MUL_1]], float* [[A_GEP_1]], align 8
; CHECK-NEXT: [[B_GEP_2:%.]] = getelementptr inbounds float, float [[B]], i64 2		; CHECK-NEXT: [[B_GEP_2:%.]] = getelementptr inbounds float, float [[B]], i64 2
; CHECK-NEXT: [[L_2:%.]] = load float, float [[B_GEP_2]], align 8		; CHECK-NEXT: [[L_2:%.]] = load float, float [[B_GEP_2]], align 8
; CHECK-NEXT: [[ADD_2:%.*]] = fadd float [[L_2]], 1.000000e+01		; CHECK-NEXT: [[ADD_2:%.*]] = fadd float [[L_2]], 1.000000e+01
; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[ADD_2]], 3.000000e+01		; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[ADD_2]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_2:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI]], i64 0, i32 0, i32 2		; CHECK-NEXT: [[A_GEP_2:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 2
; CHECK-NEXT: store float [[MUL_2]], float* [[A_GEP_2]], align 8		; CHECK-NEXT: store float [[MUL_2]], float* [[A_GEP_2]], align 8
; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds float, float [[B]], i64 3		; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds float, float [[B]], i64 3
; CHECK-NEXT: [[L_3:%.]] = load float, float [[B_GEP_3]], align 8		; CHECK-NEXT: [[L_3:%.]] = load float, float [[B_GEP_3]], align 8
; CHECK-NEXT: [[ADD_3:%.*]] = fadd float [[L_3]], 1.000000e+01		; CHECK-NEXT: [[ADD_3:%.*]] = fadd float [[L_3]], 1.000000e+01
; CHECK-NEXT: [[MUL_3:%.*]] = fmul float [[ADD_3]], 3.000000e+01		; CHECK-NEXT: [[MUL_3:%.*]] = fmul float [[ADD_3]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI]], i64 0, i32 0, i32 3		; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 3
; CHECK-NEXT: store float [[MUL_3]], float* [[A_GEP_3]], align 8		; CHECK-NEXT: store float [[MUL_3]], float* [[A_GEP_3]], align 8
		; CHECK-NEXT: br label [[EXIT_MERGE:%.*]]
		; CHECK: exit.merge:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: exit.slpversioned1:
		; CHECK-NEXT: [[A_GEP_05:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 0
		; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[B_GEP_0]] to <4 x float>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 8, !alias.scope !25, !noalias !28
		; CHECK-NEXT: [[TMP3:%.*]] = fadd <4 x float> [[TMP2]], <float 1.000000e+01, float 1.000000e+01, float 1.000000e+01, float 1.000000e+01>
		; CHECK-NEXT: [[TMP4:%.*]] = fmul <4 x float> [[TMP3]], <float 3.000000e+01, float 3.000000e+01, float 3.000000e+01, float 3.000000e+01>
		; CHECK-NEXT: [[TMP5:%.]] = bitcast float [[A_GEP_05]] to <4 x float>*
		; CHECK-NEXT: store <4 x float> [[TMP4]], <4 x float>* [[TMP5]], align 8, !alias.scope !28, !noalias !25
		; CHECK-NEXT: br label [[EXIT_MERGE]]
;		;
bb:		bb:
br label %loop		br label %loop

loop:		loop:
%ptr.phi = phi %struct.2* [ %A, %bb ], [ null, %loop ]		%ptr.phi = phi %struct.2* [ %A, %bb ], [ null, %loop ]
br i1 %c, label %exit, label %loop		br i1 %c, label %exit, label %loop

Show All 30 Lines
; CHECK-LABEL: @lcssa_phi(		; CHECK-LABEL: @lcssa_phi(
; CHECK-NEXT: bb:		; CHECK-NEXT: bb:
; CHECK-NEXT: br label [[LOOP:%.*]]		; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:		; CHECK: loop:
; CHECK-NEXT: [[PTR_PHI:%.]] = phi %struct.2 [ [[A:%.]], [[BB:%.]] ], [ null, [[LOOP]] ]		; CHECK-NEXT: [[PTR_PHI:%.]] = phi %struct.2 [ [[A:%.]], [[BB:%.]] ], [ null, [[LOOP]] ]
; CHECK-NEXT: br i1 [[C:%.]], label [[EXIT:%.]], label [[LOOP]]		; CHECK-NEXT: br i1 [[C:%.]], label [[EXIT:%.]], label [[LOOP]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: [[PTR_PHI_LCSSA:%.]] = phi %struct.2 [ [[PTR_PHI]], [[LOOP]] ]		; CHECK-NEXT: [[PTR_PHI_LCSSA:%.]] = phi %struct.2 [ [[PTR_PHI]], [[LOOP]] ]
; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr inbounds float, float [[B:%.*]], i64 0		; CHECK-NEXT: [[PTR_PHI_LCSSA22:%.]] = ptrtoint %struct.2 [[PTR_PHI_LCSSA]] to i64
		; CHECK-NEXT: [[B_GEP_0:%.]] = getelementptr float, float [[B:%.*]], i64 0
		; CHECK-NEXT: [[B_GEP_021:%.]] = ptrtoint float [[B_GEP_0]] to i64
		; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[B_GEP_021]], [[PTR_PHI_LCSSA22]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[EXIT_SCALAR:%.]], label [[EXIT_SLPVERSIONED1:%.]]
		; CHECK: exit.scalar:
; CHECK-NEXT: [[L_0:%.]] = load float, float [[B_GEP_0]], align 8		; CHECK-NEXT: [[L_0:%.]] = load float, float [[B_GEP_0]], align 8
; CHECK-NEXT: [[ADD_0:%.*]] = fadd float [[L_0]], 1.000000e+01		; CHECK-NEXT: [[ADD_0:%.*]] = fadd float [[L_0]], 1.000000e+01
; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[ADD_0]], 3.000000e+01		; CHECK-NEXT: [[MUL_0:%.*]] = fmul float [[ADD_0]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds [[STRUCT_2:%.]], %struct.2* [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 0		; CHECK-NEXT: [[A_GEP_0:%.]] = getelementptr inbounds [[STRUCT_2:%.]], %struct.2* [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 0
; CHECK-NEXT: store float [[MUL_0]], float* [[A_GEP_0]], align 8		; CHECK-NEXT: store float [[MUL_0]], float* [[A_GEP_0]], align 8
; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds float, float [[B]], i64 1		; CHECK-NEXT: [[B_GEP_1:%.]] = getelementptr inbounds float, float [[B]], i64 1
; CHECK-NEXT: [[L_1:%.]] = load float, float [[B_GEP_1]], align 8		; CHECK-NEXT: [[L_1:%.]] = load float, float [[B_GEP_1]], align 8
; CHECK-NEXT: [[ADD_1:%.*]] = fadd float [[L_1]], 1.000000e+01		; CHECK-NEXT: [[ADD_1:%.*]] = fadd float [[L_1]], 1.000000e+01
; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[ADD_1]], 3.000000e+01		; CHECK-NEXT: [[MUL_1:%.*]] = fmul float [[ADD_1]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 1		; CHECK-NEXT: [[A_GEP_1:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 1
; CHECK-NEXT: store float [[MUL_1]], float* [[A_GEP_1]], align 8		; CHECK-NEXT: store float [[MUL_1]], float* [[A_GEP_1]], align 8
; CHECK-NEXT: [[B_GEP_2:%.]] = getelementptr inbounds float, float [[B]], i64 2		; CHECK-NEXT: [[B_GEP_2:%.]] = getelementptr inbounds float, float [[B]], i64 2
; CHECK-NEXT: [[L_2:%.]] = load float, float [[B_GEP_2]], align 8		; CHECK-NEXT: [[L_2:%.]] = load float, float [[B_GEP_2]], align 8
; CHECK-NEXT: [[ADD_2:%.*]] = fadd float [[L_2]], 1.000000e+01		; CHECK-NEXT: [[ADD_2:%.*]] = fadd float [[L_2]], 1.000000e+01
; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[ADD_2]], 3.000000e+01		; CHECK-NEXT: [[MUL_2:%.*]] = fmul float [[ADD_2]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_2:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 2		; CHECK-NEXT: [[A_GEP_2:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 2
; CHECK-NEXT: store float [[MUL_2]], float* [[A_GEP_2]], align 8		; CHECK-NEXT: store float [[MUL_2]], float* [[A_GEP_2]], align 8
; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds float, float [[B]], i64 3		; CHECK-NEXT: [[B_GEP_3:%.]] = getelementptr inbounds float, float [[B]], i64 3
; CHECK-NEXT: [[L_3:%.]] = load float, float [[B_GEP_3]], align 8		; CHECK-NEXT: [[L_3:%.]] = load float, float [[B_GEP_3]], align 8
; CHECK-NEXT: [[ADD_3:%.*]] = fadd float [[L_3]], 1.000000e+01		; CHECK-NEXT: [[ADD_3:%.*]] = fadd float [[L_3]], 1.000000e+01
; CHECK-NEXT: [[MUL_3:%.*]] = fmul float [[ADD_3]], 3.000000e+01		; CHECK-NEXT: [[MUL_3:%.*]] = fmul float [[ADD_3]], 3.000000e+01
; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 3		; CHECK-NEXT: [[A_GEP_3:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 3
; CHECK-NEXT: store float [[MUL_3]], float* [[A_GEP_3]], align 8		; CHECK-NEXT: store float [[MUL_3]], float* [[A_GEP_3]], align 8
		; CHECK-NEXT: br label [[EXIT_MERGE:%.*]]
		; CHECK: exit.merge:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: exit.slpversioned1:
		; CHECK-NEXT: [[A_GEP_05:%.]] = getelementptr inbounds [[STRUCT_2]], %struct.2 [[PTR_PHI_LCSSA]], i64 0, i32 0, i32 0
		; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[B_GEP_0]] to <4 x float>*
		; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 8, !alias.scope !30, !noalias !33
		; CHECK-NEXT: [[TMP3:%.*]] = fadd <4 x float> [[TMP2]], <float 1.000000e+01, float 1.000000e+01, float 1.000000e+01, float 1.000000e+01>
		; CHECK-NEXT: [[TMP4:%.*]] = fmul <4 x float> [[TMP3]], <float 3.000000e+01, float 3.000000e+01, float 3.000000e+01, float 3.000000e+01>
		; CHECK-NEXT: [[TMP5:%.]] = bitcast float [[A_GEP_05]] to <4 x float>*
		; CHECK-NEXT: store <4 x float> [[TMP4]], <4 x float>* [[TMP5]], align 8, !alias.scope !33, !noalias !30
		; CHECK-NEXT: br label [[EXIT_MERGE]]
;		;
bb:		bb:
br label %loop		br label %loop

loop:		loop:
%ptr.phi = phi %struct.2* [ %A, %bb ], [ null, %loop ]		%ptr.phi = phi %struct.2* [ %A, %bb ], [ null, %loop ]
br i1 %c, label %exit, label %loop		br i1 %c, label %exit, label %loop

▲ Show 20 Lines • Show All 363 Lines • ▼ Show 20 Lines	bb:
ret i32 1		ret i32 1
}		}

; A test case where instructions required to compute the pointer bounds get		; A test case where instructions required to compute the pointer bounds get
; vectorized before versioning. Make sure there is no crash.		; vectorized before versioning. Make sure there is no crash.
define void @crash_instructions_deleted(float* %t, i32* %a, i32** noalias %ptr) {		define void @crash_instructions_deleted(float* %t, i32* %a, i32** noalias %ptr) {
; CHECK-LABEL: @crash_instructions_deleted(		; CHECK-LABEL: @crash_instructions_deleted(
; CHECK-NEXT: bb:		; CHECK-NEXT: bb:
		; CHECK-NEXT: [[T42:%.]] = ptrtoint float [[T:%.*]] to i64
; CHECK-NEXT: [[T15:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 2		; CHECK-NEXT: [[T15:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 2
		; CHECK-NEXT: [[T16:%.]] = getelementptr inbounds i32, i32 [[A]], i32 3
; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[T15]] to <2 x i32>*		; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[T15]] to <2 x i32>*
; CHECK-NEXT: store <2 x i32> <i32 0, i32 10>, <2 x i32>* [[TMP0]], align 8		; CHECK-NEXT: store <2 x i32> <i32 0, i32 10>, <2 x i32>* [[TMP0]], align 8
; CHECK-NEXT: [[T17:%.]] = load i32, i32** [[PTR:%.*]], align 8		; CHECK-NEXT: [[T17:%.]] = load i32, i32** [[PTR:%.*]], align 8
		; CHECK-NEXT: [[T1718:%.]] = ptrtoint i32 [[T17]] to i64
; CHECK-NEXT: br label [[BB18:%.*]]		; CHECK-NEXT: br label [[BB18:%.*]]
; CHECK: bb18:		; CHECK: bb18:
; CHECK-NEXT: [[T19:%.*]] = sext i32 0 to i64		; CHECK-NEXT: [[T19:%.*]] = sext i32 0 to i64
; CHECK-NEXT: [[T20:%.*]] = add nsw i64 1, [[T19]]		; CHECK-NEXT: [[T20:%.*]] = add nsw i64 1, [[T19]]
; CHECK-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T17]], i64 [[T20]]		; CHECK-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T17]], i64 [[T20]]
; CHECK-NEXT: [[T22:%.]] = bitcast i32 [[T21]] to i8*		; CHECK-NEXT: [[T22:%.]] = bitcast i32 [[T21]] to i8*
; CHECK-NEXT: [[T23:%.]] = getelementptr inbounds i8, i8 [[T22]], i64 1		; CHECK-NEXT: [[T23:%.]] = getelementptr inbounds i8, i8 [[T22]], i64 1
; CHECK-NEXT: [[T24:%.]] = getelementptr inbounds i8, i8 [[T22]], i64 2		; CHECK-NEXT: [[T24:%.]] = getelementptr inbounds i8, i8 [[T22]], i64 2
; CHECK-NEXT: [[T25:%.]] = getelementptr inbounds i8, i8 [[T22]], i64 3		; CHECK-NEXT: [[T25:%.]] = getelementptr inbounds i8, i8 [[T22]], i64 3
		; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[T1718]], 4
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 [[TMP1]], [[T42]]
		; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP2]], 16
		; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[BB18_SCALAR:%.]], label [[BB18_SLPVERSIONED1:%.]]
		; CHECK: bb18.scalar:
; CHECK-NEXT: [[T26:%.]] = load i8, i8 [[T22]], align 1		; CHECK-NEXT: [[T26:%.]] = load i8, i8 [[T22]], align 1
; CHECK-NEXT: [[T27:%.*]] = uitofp i8 [[T26]] to float		; CHECK-NEXT: [[T27:%.*]] = uitofp i8 [[T26]] to float
; CHECK-NEXT: [[T28:%.*]] = fdiv float [[T27]], 2.550000e+02		; CHECK-NEXT: [[T28:%.*]] = fdiv float [[T27]], 2.550000e+02
; CHECK-NEXT: [[T29:%.]] = getelementptr inbounds float, float [[T:%.*]], i64 0		; CHECK-NEXT: [[T29:%.]] = getelementptr inbounds float, float [[T]], i64 0
; CHECK-NEXT: store float [[T28]], float* [[T29]], align 8		; CHECK-NEXT: store float [[T28]], float* [[T29]], align 8
; CHECK-NEXT: [[T30:%.]] = load i8, i8 [[T23]], align 1		; CHECK-NEXT: [[T30:%.]] = load i8, i8 [[T23]], align 1
; CHECK-NEXT: [[T31:%.*]] = uitofp i8 [[T30]] to float		; CHECK-NEXT: [[T31:%.*]] = uitofp i8 [[T30]] to float
; CHECK-NEXT: [[T32:%.*]] = fdiv float [[T31]], 2.550000e+02		; CHECK-NEXT: [[T32:%.*]] = fdiv float [[T31]], 2.550000e+02
; CHECK-NEXT: [[T33:%.]] = getelementptr inbounds float, float [[T]], i64 1		; CHECK-NEXT: [[T33:%.]] = getelementptr inbounds float, float [[T]], i64 1
; CHECK-NEXT: store float [[T32]], float* [[T33]], align 4		; CHECK-NEXT: store float [[T32]], float* [[T33]], align 4
; CHECK-NEXT: [[T34:%.]] = load i8, i8 [[T24]], align 1		; CHECK-NEXT: [[T34:%.]] = load i8, i8 [[T24]], align 1
; CHECK-NEXT: [[T35:%.*]] = uitofp i8 [[T34]] to float		; CHECK-NEXT: [[T35:%.*]] = uitofp i8 [[T34]] to float
; CHECK-NEXT: [[T36:%.*]] = fdiv float [[T35]], 2.550000e+02		; CHECK-NEXT: [[T36:%.*]] = fdiv float [[T35]], 2.550000e+02
; CHECK-NEXT: [[T37:%.]] = getelementptr inbounds float, float [[T]], i64 2		; CHECK-NEXT: [[T37:%.]] = getelementptr inbounds float, float [[T]], i64 2
; CHECK-NEXT: store float [[T36]], float* [[T37]], align 8		; CHECK-NEXT: store float [[T36]], float* [[T37]], align 8
; CHECK-NEXT: [[T38:%.]] = load i8, i8 [[T25]], align 1		; CHECK-NEXT: [[T38:%.]] = load i8, i8 [[T25]], align 1
; CHECK-NEXT: [[T39:%.*]] = uitofp i8 [[T38]] to float		; CHECK-NEXT: [[T39:%.*]] = uitofp i8 [[T38]] to float
; CHECK-NEXT: [[T40:%.*]] = fdiv float [[T39]], 2.550000e+02		; CHECK-NEXT: [[T40:%.*]] = fdiv float [[T39]], 2.550000e+02
; CHECK-NEXT: [[T41:%.]] = getelementptr inbounds float, float [[T]], i64 3		; CHECK-NEXT: [[T41:%.]] = getelementptr inbounds float, float [[T]], i64 3
; CHECK-NEXT: store float [[T40]], float* [[T41]], align 4		; CHECK-NEXT: store float [[T40]], float* [[T41]], align 4
		; CHECK-NEXT: br label [[BB18_MERGE:%.*]]
		; CHECK: bb18.merge:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
		; CHECK: bb18.slpversioned1:
		; CHECK-NEXT: [[T295:%.]] = getelementptr inbounds float, float [[T]], i64 0
		; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[T22]] to <4 x i8>*
		; CHECK-NEXT: [[TMP4:%.]] = load <4 x i8>, <4 x i8> [[TMP3]], align 1, !alias.scope !35, !noalias !38
		; CHECK-NEXT: [[TMP5:%.*]] = uitofp <4 x i8> [[TMP4]] to <4 x float>
		; CHECK-NEXT: [[TMP6:%.*]] = fdiv <4 x float> [[TMP5]], <float 2.550000e+02, float 2.550000e+02, float 2.550000e+02, float 2.550000e+02>
		; CHECK-NEXT: [[TMP7:%.]] = bitcast float [[T295]] to <4 x float>*
		; CHECK-NEXT: store <4 x float> [[TMP6]], <4 x float>* [[TMP7]], align 8, !alias.scope !38, !noalias !35
		; CHECK-NEXT: br label [[BB18_MERGE]]
;		;
bb:		bb:
%t6 = icmp slt i32 10, 0		%t6 = icmp slt i32 10, 0
%t7 = icmp sgt i32 20, 20		%t7 = icmp sgt i32 20, 20
%t9 = select i1 %t7, i32 5, i32 0		%t9 = select i1 %t7, i32 5, i32 0
%t10 = select i1 %t6, i32 0, i32 %t9		%t10 = select i1 %t6, i32 0, i32 %t9
%t11 = icmp slt i32 10, 0		%t11 = icmp slt i32 10, 0
%t12 = icmp sgt i32 20, 20		%t12 = icmp sgt i32 20, 20
▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/memory-runtime-checks.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -scoped-noalias-aa -slp-vectorizer -mtriple=x86_64-apple-darwin -enable-new-pm=false -S %s \| FileCheck %s			; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -slp-memory-versioning -passes=slp-vectorizer -mtriple=x86_64-apple-darwin -S %s \| FileCheck %s
	; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -passes=slp-vectorizer -mtriple=x86_64-apple-darwin -S %s \| FileCheck %s			; RUN: opt -aa-pipeline='basic-aa,scoped-noalias-aa' -slp-memory-versioning=false -passes=slp-vectorizer -mtriple=x86_64-apple-darwin -S %s \| FileCheck --check-prefix=NOVERSION %s

				; NOVERSION-NOT: memcheck

	define void @version_multiple(i32* nocapture %out_block, i32* nocapture readonly %counter) {			define void @version_multiple(i32* nocapture %out_block, i32* nocapture readonly %counter) {
	; CHECK-LABEL: @version_multiple(			; CHECK-LABEL: @version_multiple(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[COUNTER:%.*]], align 4			; CHECK-NEXT: [[OUT_BLOCK13:%.]] = ptrtoint i32 [[OUT_BLOCK:%.*]] to i64
	; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[OUT_BLOCK:%.*]], align 4			; CHECK-NEXT: [[COUNTER12:%.]] = ptrtoint i32 [[COUNTER:%.*]] to i64
	; CHECK-NEXT: [[XOR:%.*]] = xor i32 [[TMP1]], [[TMP0]]			; CHECK-NEXT: [[TMP0:%.*]] = sub i64 [[COUNTER12]], [[OUT_BLOCK13]]
				; CHECK-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 16
				; CHECK-NEXT: br i1 [[DIFF_CHECK]], label [[ENTRY_SCALAR:%.]], label [[ENTRY_SLPVERSIONED1:%.]]
				; CHECK: entry.scalar:
				; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[COUNTER]], align 4
				; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[OUT_BLOCK]], align 4
				; CHECK-NEXT: [[XOR:%.*]] = xor i32 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: store i32 [[XOR]], i32* [[OUT_BLOCK]], align 4			; CHECK-NEXT: store i32 [[XOR]], i32* [[OUT_BLOCK]], align 4
	; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 1			; CHECK-NEXT: [[ARRAYIDX_1:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 1
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4			; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX_1]], align 4
	; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 1			; CHECK-NEXT: [[ARRAYIDX2_1:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 1
	; CHECK-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX2_1]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX2_1]], align 4
	; CHECK-NEXT: [[XOR_1:%.*]] = xor i32 [[TMP3]], [[TMP2]]			; CHECK-NEXT: [[XOR_1:%.*]] = xor i32 [[TMP4]], [[TMP3]]
	; CHECK-NEXT: store i32 [[XOR_1]], i32* [[ARRAYIDX2_1]], align 4			; CHECK-NEXT: store i32 [[XOR_1]], i32* [[ARRAYIDX2_1]], align 4
	; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 2			; CHECK-NEXT: [[ARRAYIDX_2:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 2
	; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4			; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX_2]], align 4
	; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 2			; CHECK-NEXT: [[ARRAYIDX2_2:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 2
	; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX2_2]], align 4			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX2_2]], align 4
	; CHECK-NEXT: [[XOR_2:%.*]] = xor i32 [[TMP5]], [[TMP4]]			; CHECK-NEXT: [[XOR_2:%.*]] = xor i32 [[TMP6]], [[TMP5]]
	; CHECK-NEXT: store i32 [[XOR_2]], i32* [[ARRAYIDX2_2]], align 4			; CHECK-NEXT: store i32 [[XOR_2]], i32* [[ARRAYIDX2_2]], align 4
	; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 3			; CHECK-NEXT: [[ARRAYIDX_3:%.]] = getelementptr inbounds i32, i32 [[COUNTER]], i64 3
	; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4			; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX_3]], align 4
	; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 3			; CHECK-NEXT: [[ARRAYIDX2_3:%.]] = getelementptr inbounds i32, i32 [[OUT_BLOCK]], i64 3
	; CHECK-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX2_3]], align 4			; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[ARRAYIDX2_3]], align 4
	; CHECK-NEXT: [[XOR_3:%.*]] = xor i32 [[TMP7]], [[TMP6]]			; CHECK-NEXT: [[XOR_3:%.*]] = xor i32 [[TMP8]], [[TMP7]]
	; CHECK-NEXT: store i32 [[XOR_3]], i32* [[ARRAYIDX2_3]], align 4			; CHECK-NEXT: store i32 [[XOR_3]], i32* [[ARRAYIDX2_3]], align 4
				; CHECK-NEXT: br label [[ENTRY_MERGE:%.*]]
				; CHECK: entry.merge:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
				; CHECK: entry.slpversioned1:
				; CHECK-NEXT: [[TMP9:%.]] = bitcast i32 [[COUNTER]] to <4 x i32>*
				; CHECK-NEXT: [[TMP10:%.]] = load <4 x i32>, <4 x i32> [[TMP9]], align 4, !alias.scope !0, !noalias !3
				; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[OUT_BLOCK]] to <4 x i32>*
				; CHECK-NEXT: [[TMP12:%.]] = load <4 x i32>, <4 x i32> [[TMP11]], align 4, !alias.scope !3, !noalias !0
				; CHECK-NEXT: [[TMP13:%.*]] = xor <4 x i32> [[TMP12]], [[TMP10]]
				; CHECK-NEXT: [[TMP14:%.]] = bitcast i32 [[OUT_BLOCK]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP13]], <4 x i32>* [[TMP14]], align 4, !alias.scope !3, !noalias !0
				; CHECK-NEXT: br label [[ENTRY_MERGE]]
	;			;
	entry:			entry:
	%0 = load i32, i32* %counter, align 4			%0 = load i32, i32* %counter, align 4
	%1 = load i32, i32* %out_block, align 4			%1 = load i32, i32* %out_block, align 4
	%xor = xor i32 %1, %0			%xor = xor i32 %1, %0
	store i32 %xor, i32* %out_block, align 4			store i32 %xor, i32* %out_block, align 4
	%arrayidx.1 = getelementptr inbounds i32, i32* %counter, i64 1			%arrayidx.1 = getelementptr inbounds i32, i32* %counter, i64 1
	%2 = load i32, i32* %arrayidx.1, align 4			%2 = load i32, i32* %arrayidx.1, align 4
	▲ Show 20 Lines • Show All 298 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Implement initial memory versioning.AcceptedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 454933

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Analysis/LoopAccessAnalysis.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/loadi8.ll

llvm/test/Transforms/SLPVectorizer/AArch64/memory-runtime-checks-in-loops.ll

llvm/test/Transforms/SLPVectorizer/AArch64/memory-runtime-checks.ll

llvm/test/Transforms/SLPVectorizer/X86/memory-runtime-checks.ll

[SLP] Implement initial memory versioning.
AcceptedPublic