Page MenuHomePhabricator

Add new indexed load/store intrinsics.
AbandonedPublic

Authored by HaoLiu on Apr 22 2015, 6:00 AM.

Details

Summary

Hi,

According to the comments in D8820 (Teach Loop Vectorizer about interleaved data accesses), i split that patch and this is the first patch to add support for the new intrinsics:

<4 x double> @llvm.indexed.load.v4f64 (double* <ptr>, <4 x i32> <index>, i32 <alignment>)
void @llvm.indexed.store.v4f64 (<4 x double> <value>, double* <ptr>, <4 x i32> <index>, i32 <alignment>)

Such intrinsics can be used as interleaved load/store, strided load/store, etc.

I just a bit worry about the name of "indexed". Actually there is already indexed load/store name used for load/store with indexed memory mode (the pre-incremental, post-incremental, pre-dec...). There is also masked load/store for prediction load/store. I can't find a better name for load/store with indices. If you think this name is confusing and there is a better name, I can change it.

The implementation is like the masked load/store. This patch mainly about:

(1) Add two new intrinsics and modify the LangRef.rst
(2) Add code generator for the new intrinsics. Add AArch64 backend codegen for the interleaved load/store, which is a subset of the indexed load/store.
(3) Teach the CodeGenPrepare to scalarize unsupported unsupported indexed load/store.

There is no code in the Legalization phase, as the AArch64 backend can not support other indexed load/store except interleaved load/store. Even if I add such code, I can not test. Anyway, the CodeGenPrepare can handle the unsupported cases.

There are TODOs in the CodeGenPrepare, some indexed load/store can be transfered into "a VectorLoad + a SuffleVector" or "a ShuffleVector + a VectorStore".

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu updated this revision to Diff 24217.Apr 22 2015, 6:00 AM
HaoLiu retitled this revision from to Add new indexed load/store intrinsics..
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added reviewers: delena, rengolin, ab, qcolombet.
HaoLiu edited subscribers, added: Unknown Object (MLST); removed: aemerson.Apr 22 2015, 6:01 AM

If the name of indexed.load indexed.store are not reasonable, how about indexed.gather and indexed.scatter like Elena used in D7665?

Thanks,
-Hao

HaoLiu planned changes to this revision.Apr 22 2015, 6:22 AM
ab edited edge metadata.Apr 22 2015, 8:10 AM

Thanks for splitting the patches, it makes for a nicer read!

I'm concerned about CodeGenPrepare legalization. Isn't it a non-mandatory optimization pass? A quick grep shows it's not even enabled at -O0, what happens then?

-Ahmed

rengolin edited edge metadata.Apr 22 2015, 8:29 AM

If the name of indexed.load indexed.store are not reasonable, how about indexed.gather and indexed.scatter like Elena used in D7665?

Hao,

I was under the impression that you two would use the same intrinsics, even though the hardware implementation is different, the IR concept should be generic enough to allow scatter/gather, masked loads and indexed loads to be represented in a generic way. Please comment on Elena's patch if you think we should use her intrinsics, but with a more generic name.

I don't want us creating too many load/store intrinsics, as that means poor matching for optimisation passes and make the IR more complex than it should be.

cheers,
--renato

ab added a comment.Apr 22 2015, 11:45 AM

If the name of indexed.load indexed.store are not reasonable, how about indexed.gather and indexed.scatter like Elena used in D7665?

Hao,

I was under the impression that you two would use the same intrinsics, even though the hardware implementation is different, the IR concept should be generic enough to allow scatter/gather, masked loads and indexed loads to be represented in a generic way. Please comment on Elena's patch if you think we should use her intrinsics, but with a more generic name.

Again, I agree, and now I think just straight reusing Elena's intrinsic is probably enough.

This:

%i3 = ...
...
%indices = insertelement  ...,  i32 %i3, i32 3
<4 x double> @llvm.indexed.load.v4f64 (double* %base, <4 x i32> %indices, i32 <alignment>)

seems equivalent to:

%i3 = ...
...
%p3 = getelementptr double* %base, i32 %i3
...
%ptrs = insertelement ..., double* %p3, i32 3
<4 x double> @masked.gather.v4f64 (<4 x double*> %ptrs, i32 <alignment>, <4 x i1> <1, 1, 1, 1>, <4 x double> undef)

The only problem is that the GEPs might get combined with the base somehow. So, the SelectionDAGBuilder (where it's useful to have both MLOAD and ILOAD) will have to look at the pointers, determine they're all at some constant index apart, and use ILOAD instead of MLOAD.

-Ahmed

Hi Renato and Ahmed,

I agree with your comments.

But I want to change the plan. Because I think maybe there is no need to use intrinsics.
For the interleaved load about <4 x double>

<4 x double> @llvm.indexed.load.v4f64 (double* <ptr>, <4 x i32> <index>, i32 <alignment>)

I think we can use two common IRs:

<value> = load <4 x double>, <4 x double>* <ptr>
shufflevector <4 x double> <value>, <4 x double> undef, <4 x i32> <0, 2, 1, 3>

Even though it is more complex for a backend to match two IRs, it is achievable. I think the disadvantage of intrinsics is not easy to be optimized.

I want to implement the loop vectorization on interleaved memory access with vectorload/vectorstore+shufflevector.

What do you think?

Thanks,
-Hao

HaoLiu abandoned this revision.Apr 22 2015, 7:38 PM
HaoLiu reclaimed this revision.
delena edited edge metadata.Apr 26 2015, 5:05 AM

We are interested in these intrinsics anyway:

<4 x double> @llvm.masked.indexed.load.v4f64 (double* <ptr>, <4 x i32> <index>, i32 <alignment>, <4 x i1>%mask, <4 x double>%paththru)

Presence of index means that the sequential load is not a solution.
If you are not going to implement this interface now, I’ll do this later.

  • Elena

From: Renato Golin [mailto:renato.golin@linaro.org]
Sent: Thursday, April 23, 2015 10:29
To: reviews+D9194+public+a2f9c543cd08afc3@reviews.llvm.org
Cc: Ahmed Bougacha; Hao.Liu@arm.com; llvm-commits@cs.uiuc.edu; qcolombet@apple.com; Demikhovsky, Elena
Subject: Re: [PATCH] Add new indexed load/store intrinsics.

We are interested in these intrinsics anyway:

<4 x double> @llvm.masked.indexed.load.v4f64 (double* <ptr>, <4 x i32> <index>, i32 <alignment>, <4 x i1>%mask, <4 x double>%paththru)

Presence of index means that the sequential load is not a solution.
If you are not going to implement this interface now, I’ll do this later.

  • Elena

    From: Renato Golin [mailto:renato.golin@linaro.org] Sent: Thursday, April 23, 2015 10:29 To: reviews+D9194+public+a2f9c543cd08afc3@reviews.llvm.org Cc: Ahmed Bougacha; Hao.Liu@arm.com; llvm-commits@cs.uiuc.edu; qcolombet@apple.com; Demikhovsky, Elena Subject: Re: [PATCH] Add new indexed load/store intrinsics.

Yes, it's still useful in your situation. I just think the interleaved load/store doesn't need this intrinsic.

Thanks,
-Hao

HaoLiu abandoned this revision.Apr 27 2015, 1:50 AM