This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize]Teach Loop Vectorizer about interleaved data accesses
AbandonedPublic

Authored by HaoLiu on Apr 3 2015, 3:42 AM.

Details

Summary

Hi,

Two weeks ago, I posted a rough patch for RFC review titled with "[RFC][PATCH][LoopVectorize] Teach Loop Vectorizer about interleaved data accesses". I received many comments. Thanks a lot for all your help!

According to the comments, I've refactored my patch, mainly about how to transform several interleave accesses to the vectorized version. The solution is to use two new intrinsics: index.load & index.store.

The attached patch mainly achieves:

(1) Teach LoopVectorizer to identify interleave accesses in the Legality phase. 
(2) Teach LoopVectorizer to transform interleave accesses to index.load/index.store with specific interleaved indices.
(3) Add a new simple pass in the AArch64 backend. The pass can match the specific index.load/index.store intrinsics to the ldN/stN intrinsics, so that AArch64 backend can generate ldN/stN instructions.
(4) Add two new intrinsics: index.load, index.store.
(5) Teach the LoopAccessAnalysis to check the memory dependence between strided accesses.

For the correctness, I've tested the patch with LNT, SPEC2000, SPEC2006, EEMBC, GEEKBench on AArch64 target.

For the performance, some specific benchmarks like EEMBC.rgbcmy and EEMBC.rgbyiq are expected to have huge improvements (about 6x and 3x). But two other issues prevent the loop vectorization opportunities:

Too many runtime memory checks
Type promote issue. i8 is promoted to i32, which introduce additional ZEXT and TRUNC (i8 is illegal in AArch64 but <8xi8> and <16xi8> are legal).

Anyway, these issues should be solved in the future.

Ask for code review.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu updated this revision to Diff 23210.Apr 3 2015, 3:42 AM
HaoLiu retitled this revision from to [LoopVectorize]Teach Loop Vectorizer about interleaved data accesses.
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added a subscriber: Unknown Object (MLST).
HaoLiu updated this revision to Diff 23211.Apr 3 2015, 3:47 AM
delena edited edge metadata.Apr 3 2015, 2:41 PM

(1) Teach LoopVectorizer to identify interleave accesses in the Legality phase.
Ask LoopVectorizer (Legality->target) whether the specified indices are profitable, give data type and list of constant indices.
Different sequence may be profitable for different targets.

(2) Teach LoopVectorizer to transform interleave accesses to index.load/index.store with specific interleaved indices.

index.load/store should just receive the indices. Not just interleave.

  • Elena

Hi Hao,

Thanks for working on this.

I share Ahmed’s questions and concerns.
Also, as a very high level comment, where is the part of the patch that update the documentation for those intrinsics?
Without that documentation, it is hard, at least at first glance, to review the whole patch.

Thanks,
-Quentin

Hi Ahmed,

The attached patch mainly achieves:

(1) Teach LoopVectorizer to identify interleave accesses in the Legality

phase.

(2) Teach LoopVectorizer to transform interleave accesses to

index.load/index.store with specific interleaved indices.

Very nice!

I'm curious, how does this relate (if at all) to loops like:

for (size_t i = 0; i < n; i+=2L) {
   sum += (a[i  ] + b[i  ]);
   sum += (a[i+1] + b[i+1]);
}

or even:

for (size_t i = 0; i < n; i+=2L) {
   sum += (a[i  ] + b[i  ]) * (a[i+1] + b[i+1]);
}

[Hao Liu]
A very nice case. Unfortunately, this patch cannot handle such case. To handle such case, need to schedule the load order as below:

for (size_t i = 0; i < n; i+=2L) {
   int a0 = a[i], a1 = a[i+1];
   int b0 = b[i], b1 = b[i+1];
   sum += (a0 + b0) * (a1 + b1);
}

The problem is that the identification logic is simple and cannot analyze mixed loads/stores with different base pointers/strides.
I.E. it can identify "A[i], A[i+1], B[i], B[i+1]" or "A[i+1], A[i], B[i+1], B[i]", but cannot identify "A[i], B[i], A[i+1], B[i+1]".
As to support such case needs more work, I'll add a TODO so that we can improve this in the future.

which we currently do a terrible job at (the loop reroller can't help for the
second one).

(3) Add a new simple pass in the AArch64 backend. The pass can match the

specific index.load/index.store intrinsics to the ldN/stN intrinsics, so that
AArch64 backend can generate ldN/stN instructions.

Why not do this in the SelectionDAG?

(4) Add two new intrinsics: index.load, index.store.

This is a pretty big deal, no? The title is unassuming, should this go into a
dedicated thread? I recall people wondering if there's a way to combine these
new load/store intrinsics, and I can't shake that feeling. I don't really have a
proposal though ;)

Also, once there's consensus (which there seems to be), I guess these would
need legalization?

[Hao Liu]
I think that's reasonable. As Renato, Elena and Quentin also has similar concerns. I'll split this patch into two:

(1) One patch for new intrinsics including adding llvm.index.load/llvm.index.store, handling such intrinsics in SelectionDAG, modifying LangRef.
(2) Another specific patch for LoopVectorizer including identifying and generating index.load/index.store.

What do you think?

Thanks,
-Hao

-Ahmed

(5) Teach the LoopAccessAnalysis to check the memory dependence

between strided accesses.

For the correctness, I've tested the patch with LNT, SPEC2000, SPEC2006,

EEMBC, GEEKBench on AArch64 target.

For the performance, some specific benchmarks like EEMBC.rgbcmy and

EEMBC.rgbyiq are expected to have huge improvements (about 6x and 3x). But
two other issues prevent the loop vectorization opportunities:

Too many runtime memory checks
Type promote issue. i8 is promoted to i32, which introduce additional ZEXT

and TRUNC (i8 is illegal in AArch64 but <8xi8> and <16xi8> are legal).

Anyway, these issues should be solved in the future.

Ask for code review.

Thanks,
-Hao

http://reviews.llvm.org/D8820

Files:

include/llvm/Analysis/TargetTransformInfo.h
include/llvm/Analysis/TargetTransformInfoImpl.h
include/llvm/IR/IRBuilder.h
include/llvm/IR/Intrinsics.td
lib/Analysis/LoopAccessAnalysis.cpp
lib/Analysis/TargetTransformInfo.cpp
lib/IR/IRBuilder.cpp
lib/Target/AArch64/AArch64.h
lib/Target/AArch64/AArch64InterleaveAccess.cpp
lib/Target/AArch64/AArch64TargetMachine.cpp
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
lib/Target/AArch64/AArch64TargetTransformInfo.h
lib/Target/AArch64/CMakeLists.txt
lib/Transforms/Vectorize/LoopVectorize.cpp
test/CodeGen/AArch64/interleaved-access-to-ldN-stN.ll
test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll
test/Transforms/LoopVectorize/AArch64/interleaved-access.ll

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/

llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

Hi Renato,

I think that's reasonable and I'll do that.

-----Original Message-----
From: Renato Golin [mailto:renato.golin@linaro.org]
Sent: 2015年4月5日 3:18
To: reviews+D8820+public+ca9328ed1235678c@reviews.llvm.org
Cc: Hao Liu; Arnold Schwaighofer; Hal Finkel; Tim Northover; Demikhovsky,
Elena; Quentin Colombet; Amara Emerson; LLVM Commits
Subject: Re: [PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved
data accesses

Hi Hao,

Thanks for working on this.

I share Ahmed’s questions and concerns.
Also, as a very high level comment, where is the part of the patch that update the documentation for those intrinsics?
Without that documentation, it is hard, at least at first glance, to review the whole patch.

Thanks,
-Quentin

HI Quentin,

(1) Teach LoopVectorizer to identify interleave accesses in the Legality phase.
Ask LoopVectorizer (Legality->target) whether the specified indices are profitable, give data type and list of constant indices.
Different sequence may be profitable for different targets.

(2) Teach LoopVectorizer to transform interleave accesses to index.load/index.store with specific interleaved indices.

index.load/store should just receive the indices. Not just interleave.

  • Elena

Hi Elena,

That's reasonable. I think I could split this patch into two:

(1) One patch for new intrinsics including adding llvm.index.load/llvm.index.store, handling such intrinsics in SelectionDAG, modifying LangRef.
(2) Another specific patch for LoopVectorizer including identifying and generating index.load/index.store.

Thanks,
-Hao

koviankevin removed a subscriber: koviankevin.
HaoLiu abandoned this revision.Apr 30 2015, 4:17 AM