This is an archive of the discontinued LLVM Phabricator instance.

Expand interleaved memory access pass to identify certain shuffle_vector and transform it into target specific intrinsics.
Needs ReviewPublic

Authored by wxz2020 on Feb 28 2020, 3:24 PM.

Download Raw Diff

Details

Reviewers

bogner
dorit
RKSimon
spatel
ctetreau

Summary

E.g. An interleaved load (Factor = 4):

%wide.vec = load <8 x i16>, <8 x i16>* %ptr
%strided.vec = shuffle <8 x i16> %wide.vec, <8 x i16> undef, <2 x i32><i32 0, i32 4>

%v1 = uitofp <2 x i16> %strided.vec to <2 x double>

It can be transformed into a tbl1 intrinsic in AArch64 backend to avoid the high cost extract/insert sequences.

The change is also summarized in calculating InterleavedMemoryOpCost in loop vectorizer for decision in
loop vectorization.

This change will give SPEC2017 538.imagick_r 11.5% performance boost.

Tested using:
%llvm/build/bin/llvm-lit ../../test/*

And there is no regression on the test.

And we also tested this on SPEC2017 whole suite and they all pass and there is no performance regression.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wxz2020 created this revision.Feb 28 2020, 3:24 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 28 2020, 3:24 PM

Herald added subscribers: llvm-commits, hiraditya, kristof.beyls. · View Herald Transcript

joelkevinjones added a subscriber: joelkevinjones.Feb 28 2020, 3:25 PM

Harbormaster failed remote builds in B47645: Diff 247374!Feb 28 2020, 3:48 PM

Hello. Looks interesting. We ended up doing something similar with lowering of interleaved access groups to VMOVN instructions in MVE. It went straight through ISel though, not needing to go via the InterleavedAccessPass. I don't immediately see why this case would need to be done differently. It looks like ISel can already generate at least some TBL1 instructions.

Can you add some testcases for this? Both for producing this in the vectorizer/costmodel tests and for the backend codegen of the load+shuffle patterns you expect to see.

Some other initial thoughts:

Can you run clang-format over the patch. That would lower the amount of noise from the lint bot.
VF can be calculated from VecTy and Factor.
Checking the instruction users are a certain kind sounds odd. Can you explain why it's checking that and only generating in those cases?
I was half expecting BaseT::getInterleavedMemoryOpCost to return something like getMemoryOpCost + getShuffleCost, but it seems to use the cost of inserts + extracts.

RKSimon added reviewers: RKSimon, spatel.Mar 1 2020, 1:31 PM

This upload changed the clang-format problems.

Harbormaster failed remote builds in B48062: Diff 248210!Mar 4 2020, 10:05 AM

yet another missing format error.

Harbormaster failed remote builds in B48081: Diff 248253!Mar 4 2020, 11:48 AM

Fix clang-tidy warning and one clang-format error.

lkail added a subscriber: lkail.Mar 7 2020, 2:48 AM

Herald added a subscriber: danielkiss. · View Herald TranscriptMar 7 2020, 2:48 AM

xbolva00 added a subscriber: xbolva00.Mar 7 2020, 5:28 AM

xbolva00 added inline comments.

llvm/lib/CodeGen/InterleavedAccessPass.cpp
457	Missing word?

wxz2020 added inline comments.Mar 9 2020, 8:43 AM

llvm/lib/CodeGen/InterleavedAccessPass.cpp
457	Will remove it. Wrong cut and paste.

Fix a minor comment.

The title of this patch confused me. This is not adding a pass; it is extending existing passes.
I did not see an answer to @dmgreen's questions - maybe the biggest one: why is this transform not done entirely in SelectionDAG?
And why would we want AArch-specific "tbl1" pattern-matching in the generic VectorUtils.h?

On 05/03/2020 00:35, Wei Zhao wrote:

Why not do this work in ISel?

In LLVM IR, we have shufflevector
In LLVM DAG, we have vector_shuffle

In ISel, the normal process to translate LLVM IR shufflevector to TBLn instruction is to translate it first to DAG node vector_shuffle, and then in the DAG Legalize() to lower it to machine specific instructions like TBLn.

There is a very strict and deeply rooted convention or requirement on how to translate LLVM IR shufflevector -> DAG node vector_shuffle.

In the following example:
%v3 = shufflevector <8 x i8> %v1, <8 x i8> %v2, <8 x i32>
<i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6> ; yields <8 x i8>
%v3 and %v1, %v2 have to be the same type.

If not, ISel won't translate it to DAG node vector_shuffle and later it won't be lowered into a TBLn instruction.
Instead, the ISel(it has at most 9 steps in total) will lower it into a sequence of extract/insert instructions to form a new vector, as in the above example, %v3 will be generated by a sequence of extract/insert instructions.

Because ISel is coded like this, it is very hard to break this rule and put the handling of irregular (mismatched type) shufflevector in the ISel stage.

To my understanding, the type match requirement on shuffle vector can be dated back to the early days of LLVM when it was used to generate code for Motorola Altivec.

Because of all these, we decided to following the example of Ldn/Stn generation in InterleavedAccessPass(), just directly generate Tbl1 instruction from LLVM IR before entering ISel.

You may be right that IR is the simpler place for this to be. If we could add it to ISel though, that would be the more natural place for it. And would stop us having to look at uses like this. ISel knows about all the types and lanes that are going on at the machine level. A lot of the times that these pre-isel passes have been added feel like a mistake to me. When there is masking involved it is sometimes really required, but dagcombine is good at taking many things into account, optimising many things all at once.

There is ReconstructShuffle and isShuffleMaskLegal, but they might not be applicable with the way that this example goes via a smaller legal vector. At one point it looks like we have:

        t5: v8i16,ch = load<(load 16 from %ir.scevgep238, align 2)> t0, t2, undef:i64
      t16: i32 = extract_vector_elt t5, Constant:i64<0>
      t17: i32 = extract_vector_elt t5, Constant:i64<4>
    t20: v2i32 = BUILD_VECTOR t16, t17
    t22: v2i32 = BUILD_VECTOR Constant:i32<65535>, Constant:i32<65535>
  t24: v2i32 = and t20, t22
t26: v2i64 = zero_extend t24

That going into and out of a v2i32/v4i32 is not going to be efficient. It may make sense to take sequences like this and flatten them to simpler code, potentially using the VTBL like you propose. (It might turn out this example is nothing like your actual code though, or that some cases are not worth optimising like that. The VTBL is a powerful instruction, but comes at a cost. We need to be careful not to overuse it where simpler code would be better.)

I'm not convinced either way, but needing to check the uses lie that sounds off to me.

Wei's answer on 3/9/2020

ISel is a cascaded and complex process. Before the actual MC (machine code) selection, it usually goes through the following steps:
#1 Combine (1)
#2 legalize_types()
#3 Combine (lt)
#4 LegalizeVectors()
#5 Legalize_types()
#6 Combine (lv)
#7 Legalize()
#8 Combine(2)
#9 Instruction selection begins

#4, #5, and #6 are executed only if there are vector instructions.

Combine() is the place where DAG node is combined, optimized and adjusted to get it close to the final MC. It is called 4 times at most and in every time it does the almost the same thing to the DAG forest.

Currently, in your above example, #1 combine() can combine t16, t17, t20 into in a DAG node vector_shuffle if t5 and t20 are of the same type. The reason for this "same type" requirement is not for TBLn instructions, as TBLn instructions have no concept of any type, it only knows byte. "same type" is required here mainly for the output position. Once the type matched, it can figure out both the input/output position thus works out the TBL mask. For the mismatched type situation, just like in the above example, the current ISel will leave it to #7 Legalize() to translate them to extract/insert so that it can be mapped to MC.

Your above code sequence is actually an intermediate result after #5 legalize_types().

t16/t17 should be i16 after #1 Combine(1), the Combine() process works in bottom-up way, after it visits t20(BUILD_VECTOR), it will then visit t17, t16.
In #2 legalize_types(), it finds out t16/t17 and they are scalars, and on target AArch64, scalar should be at least 32-bit, so it promotes them to 32-bit.

In #4 and #5, to accommodate v2i16->v2i32, it needs to insert AND, and from v2i32->v2i64, it need to insert zero_extend.

So if we want to do the work in ISel, it has to be done in the #1 combine(1), otherwise, the later passes will add more instructions and type promotions which make the changes not profitable and even not possible.

However, to do it in #1 combine(1) is hard. The DAG combine() pass is a bottom-up process. It makes the combining of the 3 layers nodes very difficult.

To combine this DAG (extract_vector_elt/extract_vector_elt/BUILD_VECTOR) we need to look at the BUILD_VECTOR's user instruction.

The two extract_vector_elt gives the input elements' information. We need BUILD_VECTOR's user instruction to figure out its output position. Just like the TBLn instruction's mask, it contains both input vector element position(the index number in each byte) and the place(the byte position) they should be placed in the mask(the output position).

However, when combine(1) visits BUILD_VECTOR node, its user node has already been processed. You can find BUILD_VECTOR's input nodes, you cannot find BUILD_VECTOR's output nodes or its user instructions. Also it might have multiple user instructions. So the only way to solve this is to look at every possible DAG node(they could be a user instruction of the BUILD_VECTOR node) and try to look at its parent node (check to see if it is BUILD_VECTOR node) and look at its grandparent node to see if they are extract_vector_elt node. In this way, it needs to look at 3 layers of DAG nodes to decide if they can be combined into one vectorshuffle DAG node(TBLn equivalent). We cannot say this is not doable, but it is too tedious and error-prone.

From the design philosophy of DAG lowering/combining, we think it is designed to deal with a local DAG node or DAG node with its parent. Not this 3 generation DAG region. While in LLVM IR, this becomes a very easy task, everything is linear there, and the shufflevector IR already has its input mask, all we need to do is to figure out its output position form its user instruction so that we can get the complete information for TBL byte mask.

END (Wei's answer on 3/9/2020)

Why looking at shufflevector's user instruction?

Let's look at another example:
%wide.vec = load <8 x i16>, <8 x i16>* %scevgep238, align 2, !tbaa !8
%strided.vec = shufflevector <8 x i16> %wide.vec, <8 x i16> undef, <2 x i32> <i32 0, i32 4>
%2 = uitofp <2 x i16> %strided.vec to <2 x double>
Basically it loads a 128bit data into a V register, and it will extract the 0th 16bit and the 4th 16bit to form a v2i16 vector. Note here, the %strided.vec has a type v2i16, and the input vector has a type v8i16, they are not the same. So based on the above analysis, the current ISel will lower this into a sequence of extract/insert+shifting+masking instructions, its cost is very high.

The type match requirement is not baseless as it is met it can help to form a "complete" or "128bit" vector with every byte defined so that its user instruction can work on it.

In the above example, the result of the shufflevector is a v2i16 vector, it cannot be directly used by its user instruction uitofp without shifting and alignment according to the type <2 x double>.

So by looking at the user instruction, we can figure out the corresponding Tbl1 instruction's mask. In this case, we know the first element of %strided.vec <v2i16> should have an offset 0(byte offset), while the 2nd element should have an offset of (2-1)*bitwidth(double). On AArch64, TBLn instruction has no type concept, it only knows byte. The mask is byte based. In the above example, we can use AArch64's Tbl1 instruction to place the v2i16 vector into a 128-bit register and avoid those high cost of extract/insert(actually, also a few shifting/masking) instructions. This brings huge performance gain.

I see. You are essentially looking for a <2 x double>, so that you know the lanes the result will need to end up in.

Am I right in saying that in that example above this would be the same as (and (load v8i16, 0x000000000000ffff000000000000ffff)). Because the lanes are already in the correct place, they just need to be masked off? In other cases they would need to be shifted too (so a single vtbl may do better).

Another point for ISel would be that it knows these types. Not sure if it would be better or just as awkward.

Wei's answer on 3/9/2020

Explained above.

END (Wei's answer on 3/9/2020)

Cost model

Because shufflevector on mismatched type vectors are usually lowered into a sequence of extract/insert instructions, LLVM loop vectorizer give a very high cost to interleaved memory access based on this extract/insert approach to form a vector. For example, in the above example, to form a v2i16 vector from v8i16 vector, when VF is 2, UF=4, the cost is 34 for this interleaved memory access. This high cost will kill this loop's vectorization as most other instructions usually costs 0/1. Without the use of Tbl1 instruction to form a new vector at a low cost, many loops will not be vectorized because of this high cost.

On Marvell's thunder2tx99, the Tbl1 cost 5 cycles to finish which is almost the same a fadd/fmul. While Tbl2/3/4 will cost more cycles to finish.

Tbl1 definitely cost much less cycles than a sequence of extract/insert instructions to forma vector. And this is the motivation of our work.

Yep. I was thinking that the VTBL would cost 1, but that doesn't also account for the cost of the load. (The VTBL is also probably 3 instructions if I'm understanding correctly, an adr, a load and the vtbl. Plus a constant pool block. But the adr and load should be pulled up out of the loop, making them free for the vectorizers cost model. They just end up taking up an extra register).

I think it's worth considering this as at least 2 separate parts, one that gets the backend to lower to the better sequences of instructions (through this or through ISel if it would work), and the other that gets the vectorizer to produce the code we want (by adjusting the cost model).

Both patches need tests to show that they are behaving correctly. Those tests might also shed more light on the best way forward here.

Looks very interesting,
Dave

Wei's answer on 3/9/2020

The current LLVM loop vectorizer cost model is based on the assumption that for interleaved memory access there is only one way to form an interleaved vector. That is the extract/insert model. Extract scalars from a vector and insert a few of them back to form a new vector. The cost model basically mimics this process to calculate the cost. Nothing wrong here, just usually the cost is very high because to form a vector, it needs a sequence of extract/insert to do the job.

With our work, when certain conditions are met, the interleaved vector can be formed by Tbl1 instruction. So the cost model should return a different, usually much lower cost, to the loop vectorizer under such conditions. And let the loop vectorizer make the call to see if vectorizing the loop is profitable or not.

To make the loop vectorizer's cost model Tbl1 instruction AWARE is one part of the work. Otherwise, the loop vectorizer will make decision assuming there is no Tbl1 instruction to form an interleaved vector.

END (Wei's answer on 3/9/2020)

Wei Zhao
    __o   Hurry ...
_  \<,_
(_)/ (_)
~~~~~~~~~~~~

-----Original Message-----
From: Dave Green via Phabricator <reviews@reviews.llvm.org>
Sent: Sunday, March 1, 2020 7:46 AM
To: Wei Zhao <wxz@marvell.com>; mail@justinbogner.com; dorit.nuzman@intel.com
Cc: david.green@arm.com; joelkevinjones@gmail.com; kristof.beyls@arm.com; hiraditya@msn.com; llvm-commits@lists.llvm.org; t.p.northover@gmail.com; mcrosier@codeaurora.org; florian_hahn@apple.com; simon.moll@emea.nec.com; daniel.kiss@arm.com
Subject: [EXT] [PATCH] D75388: Add a pass to identify certain shuffle_vector and transform it into target specific intrinsics.

External Email

dmgreen added a comment.

Hello. Looks interesting. We ended up doing something similar with lowering of interleaved access groups to VMOVN instructions in MVE. It went straight through ISel though, not needing to go via the InterleavedAccessPass. I don't immediately see why this case would need to be done differently. It looks like ISel can already generate at least some TBL1 instructions.

Can you add some testcases for this? Both for producing this in the vectorizer/costmodel tests and for the backend codegen of the load+shuffle patterns you expect to see.

Some other initial thoughts:

Can you run clang-format over the patch. That would lower the amount of noise from the lint bot.

VF can be calculated from VecTy and Factor.

Checking the instruction users are a certain kind sounds odd. Can you explain why it's checking that and only generating in those cases?

I was half expecting BaseT::getInterleavedMemoryOpCost to return something like getMemoryOpCost + getShuffleCost, but it seems to use the cost of inserts + extracts.

Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D75388_new_&d=DwIFAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=uyxacxdjzpq-fLmkeDKKtQ&m=0n9ZhNTN1_I6H0N4p3WRoaHCWAvbNBqGaASQfJXEdtE&s=1psgguuIwGypLJw0arv2yLvATznte7hELMxvk0Sf3j4&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D75388&d=DwIFAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=uyxacxdjzpq-fLmkeDKKtQ&m=0n9ZhNTN1_I6H0N4p3WRoaHCWAvbNBqGaASQfJXEdtE&s=syU3AG-gSHXaKkNZZ8787kW7omfRxJZ45YQxNkNX2FE&e=

Sanjay Patel :The title of this patch confused me. This is not adding a pass; it is extending existing passes.
WZ: Agree. It is not a new LLVM pass. The work is piggy backed on an existing pass. I will change the title.

Sanjay Patel: I did not see an answer to @dmgreen's questions - maybe the biggest one: why is this transform not done entirely in SelectionDAG?
WZ: The answer is in the above comment.

And why would we want AArch-specific "tbl1" pattern-matching in the generic VectorUtils.h?
WZ; Good point! This is an AArch64 specific optimization. The mask is based on AArch64's Tbl1 instruction mask definition. So it should be in a target specific place. I will work out a fix and upload it.

Thanks!

wxz2020 retitled this revision from Add a pass to identify certain shuffle_vector and transform it into target specific intrinsics. to Expand interleaved memory access pass to identify certain shuffle_vector and transform it into target specific intrinsics..Mar 18 2020, 2:15 PM

I'm still not convinced that this shouldn't be done in ISel. There's nothing cross-block going on, so this is what ISel is designed for. It might make sense not to think of this as trying to convert vector_shuffle to something else, but instead trying to convert what vector_shuffle has turned into into something more optimal. In that one case I looked at, there was something like a (v4i32 (ext (v2i32 (buildvector (v4i32,..))). We get this way because a v2i16 was legalised to a v2i32, but everything around it was a v4i32. Can we "flatten" the ext into a single BUILDVECTOR? I have not had time to see if that is or isn't possible, but it sounds more sensible than very special case pre-isel legalisation for certain shuffle_vector's.

Also, as I said before:

This should be 2 separate patches.
They both need (lots of) tests. :)

Move the creation of the Tbl1 instruction's mask from LLVM generic code to AArch64 target specific code.

Harbormaster failed remote builds in B51045: Diff 253724!Mar 30 2020, 4:57 PM

fix clang-format issue.

Harbormaster failed remote builds in B51132: Diff 253886!Mar 31 2020, 8:16 AM

minor format change

Harbormaster failed remote builds in B51151: Diff 253923!Mar 31 2020, 10:53 AM

fix the warning.

Harbormaster failed remote builds in B51162: Diff 253942!Mar 31 2020, 12:33 PM

Patch the function interface changes in other targets to avoid build failure for ARM, Hexagon, PowerPC, SystemZ and X86.

Herald added a reviewer: ctetreau. · View Herald TranscriptApr 15 2020, 8:58 AM

Herald added subscribers: kbarton, nemanjai. · View Herald Transcript

Harbormaster failed remote builds in B53379: Diff 257744!Apr 15 2020, 9:17 AM

Herald added a subscriber: • wuzish. · View Herald TranscriptApr 15 2020, 9:17 AM

RKSimon added inline comments.Apr 15 2020, 9:41 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
986	Since you're changing the function signature anyway would it make sense to change this to a VectorType *VecTy? There's been chatter on other patches about making this jump in general as many of the TTI calls expect a vector anyhow.

wxz2020 added inline comments.Apr 15 2020, 9:56 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
986	Can you give me an example about this?

RKSimon added inline comments.Apr 15 2020, 11:41 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
986	@ctetreau's work on cleaning up the vector getters e.g. D77264 hits this a lot

ctetreau added inline comments.Apr 15 2020, 2:09 PM

llvm/include/llvm/Analysis/TargetTransformInfo.h
986	I'm working on a major cleanup and refactor of VectorType. As part of this, getVectorNumElements, getVectorElementCount, getVectorElementType, and getVectorIsScalable are all being removed. If the VecTy argument to this function is expected to be an instance of VectorType, then it should take a pointer to VectorType. Then, in the body, code can directly call the methods of VectorType without using the base Type asserting getters.

Fix the precheck errors.

Harbormaster failed remote builds in B53447: Diff 257879!Apr 15 2020, 4:02 PM

RKSimon added a subscriber: samparker.Apr 17 2020, 10:37 AM

RKSimon added inline comments.

llvm/include/llvm/Analysis/TargetTransformInfo.h
986	@samparker has examples of the VectorType change in D78357

Merged with the current code.

Harbormaster failed remote builds in B54403: Diff 259577!Apr 23 2020, 9:43 AM

Fix a few format problems.

Harbormaster failed remote builds in B54498: Diff 259724!Apr 23 2020, 5:25 PM

An other miss the format. fixed

steleman added a subscriber: steleman.Jun 8 2020, 5:02 PM

@wxz2020 Please can you provide a range of real test cases of what you're trying to achieve?

As several reviewers have said now, this looks like it should handled in isel and actual tests would help explain any difficulty you have getting it to work there. The one example you mention in your intro looks pretty trivial to handle in selectiondag tbh.

Herald added a subscriber: bmahjour. · View Herald TranscriptJun 18 2020, 2:05 AM

Sure, I will upload them soon.

bmahjour removed a subscriber: bmahjour.Jun 18 2020, 6:56 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

25 lines

TargetTransformInfoImpl.h

3 lines

CodeGen/

BasicTTIImpl.h

3 lines

TargetLowering.h

6 lines

IR/

IntrinsicsAArch64.td

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

14 lines

CodeGen/

InterleavedAccessPass.cpp

29 lines

InterleavedLoadCombinePass.cpp

4 lines

Target/

AArch64/

AArch64ISelDAGToDAG.cpp

13 lines

AArch64ISelLowering.h

9 lines

AArch64ISelLowering.cpp

165 lines

AArch64TargetTransformInfo.h

3 lines

AArch64TargetTransformInfo.cpp

40 lines

Transforms/

Vectorize/

LoopVectorize.cpp

2 lines

Diff 253942

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 976 Lines • ▼ Show 20 Lines	public:
/// \p VecTy is the vector type of the interleaved access.		/// \p VecTy is the vector type of the interleaved access.
/// \p Factor is the interleave factor		/// \p Factor is the interleave factor
/// \p Indices is the indices for interleaved load members (as interleaved		/// \p Indices is the indices for interleaved load members (as interleaved
/// load allows gaps)		/// load allows gaps)
/// \p Alignment is the alignment of the memory operation		/// \p Alignment is the alignment of the memory operation
/// \p AddressSpace is address space of the pointer.		/// \p AddressSpace is address space of the pointer.
/// \p UseMaskForCond indicates if the memory access is predicated.		/// \p UseMaskForCond indicates if the memory access is predicated.
/// \p UseMaskForGaps indicates if gaps should be masked.		/// \p UseMaskForGaps indicates if gaps should be masked.
int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(Instruction *I, unsigned VF, unsigned Opcode,
		Type *VecTy, unsigned Factor,
		RKSimonUnsubmitted Not Done Reply Inline Actions Since you're changing the function signature anyway would it make sense to change this to a VectorType VecTy? There's been chatter on other patches about making this jump in general as many of the TTI calls expect a vector anyhow. RKSimon:* Since you're changing the function signature anyway would it make sense to change this to a…
		wxz2020AuthorUnsubmitted Not Done Reply Inline Actions Can you give me an example about this? wxz2020: Can you give me an example about this?
		RKSimonUnsubmitted Not Done Reply Inline Actions @ctetreau's work on cleaning up the vector getters e.g. D77264 hits this a lot RKSimon: @ctetreau's work on cleaning up the vector getters e.g. D77264 hits this a lot
		ctetreauUnsubmitted Not Done Reply Inline Actions I'm working on a major cleanup and refactor of VectorType. As part of this, getVectorNumElements, getVectorElementCount, getVectorElementType, and getVectorIsScalable are all being removed. If the VecTy argument to this function is expected to be an instance of VectorType, then it should take a pointer to VectorType. Then, in the body, code can directly call the methods of VectorType without using the base Type asserting getters. ctetreau: I'm working on a major cleanup and refactor of VectorType. As part of this…
		RKSimonUnsubmitted Not Done Reply Inline Actions @samparker has examples of the VectorType change in D78357 RKSimon: @samparker has examples of the VectorType change in D78357
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace,		unsigned AddressSpace,
bool UseMaskForCond = false,		bool UseMaskForCond = false,
bool UseMaskForGaps = false) const;		bool UseMaskForGaps = false) const;

/// Calculate the cost of performing a vector reduction.		/// Calculate the cost of performing a vector reduction.
///		///
/// This is the cost of reducing the vector value of type \p Ty to a scalar		/// This is the cost of reducing the vector value of type \p Ty to a scalar
▲ Show 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	virtual int getCmpSelInstrCost(unsigned Opcode, Type *ValTy,
Type CondTy, const Instruction I) = 0;		Type CondTy, const Instruction I) = 0;
virtual int getVectorInstrCost(unsigned Opcode, Type *Val,		virtual int getVectorInstrCost(unsigned Opcode, Type *Val,
unsigned Index) = 0;		unsigned Index) = 0;
virtual int getMemoryOpCost(unsigned Opcode, Type *Src, MaybeAlign Alignment,		virtual int getMemoryOpCost(unsigned Opcode, Type *Src, MaybeAlign Alignment,
unsigned AddressSpace, const Instruction *I) = 0;		unsigned AddressSpace, const Instruction *I) = 0;
virtual int getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		virtual int getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) = 0;		unsigned AddressSpace) = 0;
virtual int getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,		virtual int getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,
bool VariableMask, unsigned Alignment,		bool VariableMask, unsigned Alignment,
const Instruction *I = nullptr) = 0;		const Instruction *I = nullptr) = 0;
virtual int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		virtual int
unsigned Factor,		getInterleavedMemoryOpCost(Instruction *I, unsigned VF, unsigned Opcode,
ArrayRef<unsigned> Indices,		Type *VecTy, unsigned Factor,
unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace,		unsigned AddressSpace, bool UseMaskForCond = false,
bool UseMaskForCond = false,
bool UseMaskForGaps = false) = 0;		bool UseMaskForGaps = false) = 0;
virtual int getArithmeticReductionCost(unsigned Opcode, Type *Ty,		virtual int getArithmeticReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) = 0;		bool IsPairwiseForm) = 0;
virtual int getMinMaxReductionCost(Type Ty, Type CondTy,		virtual int getMinMaxReductionCost(Type Ty, Type CondTy,
bool IsPairwiseForm, bool IsUnsigned) = 0;		bool IsPairwiseForm, bool IsUnsigned) = 0;
virtual int getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		virtual int getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys, FastMathFlags FMF,		ArrayRef<Type *> Tys, FastMathFlags FMF,
unsigned ScalarizationCostPassed,		unsigned ScalarizationCostPassed,
const Instruction *I) = 0;		const Instruction *I) = 0;
▲ Show 20 Lines • Show All 406 Lines • ▼ Show 20 Lines	int getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
return Impl.getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace);		return Impl.getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace);
}		}
int getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,		int getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,
bool VariableMask, unsigned Alignment,		bool VariableMask, unsigned Alignment,
const Instruction *I = nullptr) override {		const Instruction *I = nullptr) override {
return Impl.getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,		return Impl.getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
Alignment, I);		Alignment, I);
}		}
int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(Instruction *I, unsigned VF, unsigned Opcode,
		Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace, bool UseMaskForCond,		unsigned AddressSpace, bool UseMaskForCond,
bool UseMaskForGaps) override {		bool UseMaskForGaps) override {
return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return Impl.getInterleavedMemoryOpCost(I, VF, Opcode, VecTy, Factor,
Alignment, AddressSpace,		Indices, Alignment, AddressSpace,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}
int getArithmeticReductionCost(unsigned Opcode, Type *Ty,		int getArithmeticReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) override {		bool IsPairwiseForm) override {
return Impl.getArithmeticReductionCost(Opcode, Ty, IsPairwiseForm);		return Impl.getArithmeticReductionCost(Opcode, Ty, IsPairwiseForm);
}		}
int getMinMaxReductionCost(Type Ty, Type CondTy,		int getMinMaxReductionCost(Type Ty, Type CondTy,
bool IsPairwiseForm, bool IsUnsigned) override {		bool IsPairwiseForm, bool IsUnsigned) override {
▲ Show 20 Lines • Show All 225 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	public:
}		}

unsigned getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,		unsigned getGatherScatterOpCost(unsigned Opcode, Type DataTy, Value Ptr,
bool VariableMask, unsigned Alignment,		bool VariableMask, unsigned Alignment,
const Instruction *I = nullptr) {		const Instruction *I = nullptr) {
return 1;		return 1;
}		}

unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		unsigned getInterleavedMemoryOpCost(Instruction *I, unsigned VF,
		unsigned Opcode, Type *VecTy,
unsigned Factor,		unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Alignment, unsigned AddressSpace,		unsigned Alignment, unsigned AddressSpace,
bool UseMaskForCond = false,		bool UseMaskForCond = false,
bool UseMaskForGaps = false) {		bool UseMaskForGaps = false) {
return 1;		return 1;
}		}

▲ Show 20 Lines • Show All 458 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 913 Lines • ▼ Show 20 Lines	if (Src->isVectorTy() &&
Cost += getScalarizationOverhead(Src, Opcode != Instruction::Store,		Cost += getScalarizationOverhead(Src, Opcode != Instruction::Store,
Opcode == Instruction::Store);		Opcode == Instruction::Store);
}		}
}		}

return Cost;		return Cost;
}		}

unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		unsigned getInterleavedMemoryOpCost(Instruction *I, unsigned VF,
		unsigned Opcode, Type *VecTy,
unsigned Factor,		unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Alignment, unsigned AddressSpace,		unsigned Alignment, unsigned AddressSpace,
bool UseMaskForCond = false,		bool UseMaskForCond = false,
bool UseMaskForGaps = false) {		bool UseMaskForGaps = false) {
VectorType *VT = dyn_cast<VectorType>(VecTy);		VectorType *VT = dyn_cast<VectorType>(VecTy);
assert(VT && "Expect a vector type for interleaved memory op");		assert(VT && "Expect a vector type for interleaved memory op");

▲ Show 20 Lines • Show All 842 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 2,558 Lines • ▼ Show 20 Lines	public:
/// \p SI is the vector store instruction.		/// \p SI is the vector store instruction.
/// \p SVI is the shufflevector to RE-interleave the stored vector.		/// \p SVI is the shufflevector to RE-interleave the stored vector.
/// \p Factor is the interleave factor.		/// \p Factor is the interleave factor.
virtual bool lowerInterleavedStore(StoreInst SI, ShuffleVectorInst SVI,		virtual bool lowerInterleavedStore(StoreInst SI, ShuffleVectorInst SVI,
unsigned Factor) const {		unsigned Factor) const {
return false;		return false;
}		}

		/// Lower a shufflevector to target specific intrinsics. Return
		/// true on success.
		///
		/// \p SI is the shufflevector to RE-interleave the stored vector.
		virtual bool lowerShuffleVector(ShuffleVectorInst *SI) const { return false; }

/// Return true if zero-extending the specific node Val to type VT2 is free		/// Return true if zero-extending the specific node Val to type VT2 is free
/// (either because it's implicitly zero-extended such as ARM ldrb / ldrh or		/// (either because it's implicitly zero-extended such as ARM ldrb / ldrh or
/// because it's folded such as X86 zero-extending loads).		/// because it's folded such as X86 zero-extending loads).
virtual bool isZExtFree(SDValue Val, EVT VT2) const {		virtual bool isZExtFree(SDValue Val, EVT VT2) const {
return isZExtFree(Val.getValueType(), VT2);		return isZExtFree(Val.getValueType(), VT2);
}		}

/// Return true if an fpext operation is free (for instance, because		/// Return true if an fpext operation is free (for instance, because
▲ Show 20 Lines • Show All 1,836 Lines • Show Last 20 Lines

llvm/include/llvm/IR/IntrinsicsAArch64.td

Show First 20 Lines • Show All 564 Lines • ▼ Show 20 Lines
def int_aarch64_neon_st3 : AdvSIMD_3Vec_Store_Intrinsic;		def int_aarch64_neon_st3 : AdvSIMD_3Vec_Store_Intrinsic;
def int_aarch64_neon_st4 : AdvSIMD_4Vec_Store_Intrinsic;		def int_aarch64_neon_st4 : AdvSIMD_4Vec_Store_Intrinsic;

def int_aarch64_neon_st2lane : AdvSIMD_2Vec_Store_Lane_Intrinsic;		def int_aarch64_neon_st2lane : AdvSIMD_2Vec_Store_Lane_Intrinsic;
def int_aarch64_neon_st3lane : AdvSIMD_3Vec_Store_Lane_Intrinsic;		def int_aarch64_neon_st3lane : AdvSIMD_3Vec_Store_Lane_Intrinsic;
def int_aarch64_neon_st4lane : AdvSIMD_4Vec_Store_Lane_Intrinsic;		def int_aarch64_neon_st4lane : AdvSIMD_4Vec_Store_Lane_Intrinsic;

let TargetPrefix = "aarch64" in { // All intrinsics start with "llvm.aarch64.".		let TargetPrefix = "aarch64" in { // All intrinsics start with "llvm.aarch64.".
		class AdvSIMD_Tbl1_temp_Intrinsic
		: Intrinsic<[llvm_anyvector_ty], [llvm_anyvector_ty, llvm_v16i8_ty],
		[IntrNoMem]>;
class AdvSIMD_Tbl1_Intrinsic		class AdvSIMD_Tbl1_Intrinsic
: Intrinsic<[llvm_anyvector_ty], [llvm_v16i8_ty, LLVMMatchType<0>],		: Intrinsic<[llvm_anyvector_ty], [llvm_v16i8_ty, LLVMMatchType<0>],
[IntrNoMem]>;		[IntrNoMem]>;
class AdvSIMD_Tbl2_Intrinsic		class AdvSIMD_Tbl2_Intrinsic
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[llvm_v16i8_ty, llvm_v16i8_ty, LLVMMatchType<0>], [IntrNoMem]>;		[llvm_v16i8_ty, llvm_v16i8_ty, LLVMMatchType<0>], [IntrNoMem]>;
class AdvSIMD_Tbl3_Intrinsic		class AdvSIMD_Tbl3_Intrinsic
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
Show All 25 Lines	: Intrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, llvm_v16i8_ty, llvm_v16i8_ty,		[LLVMMatchType<0>, llvm_v16i8_ty, llvm_v16i8_ty,
llvm_v16i8_ty, llvm_v16i8_ty, LLVMMatchType<0>],		llvm_v16i8_ty, llvm_v16i8_ty, LLVMMatchType<0>],
[IntrNoMem]>;		[IntrNoMem]>;
}		}
def int_aarch64_neon_tbl1 : AdvSIMD_Tbl1_Intrinsic;		def int_aarch64_neon_tbl1 : AdvSIMD_Tbl1_Intrinsic;
def int_aarch64_neon_tbl2 : AdvSIMD_Tbl2_Intrinsic;		def int_aarch64_neon_tbl2 : AdvSIMD_Tbl2_Intrinsic;
def int_aarch64_neon_tbl3 : AdvSIMD_Tbl3_Intrinsic;		def int_aarch64_neon_tbl3 : AdvSIMD_Tbl3_Intrinsic;
def int_aarch64_neon_tbl4 : AdvSIMD_Tbl4_Intrinsic;		def int_aarch64_neon_tbl4 : AdvSIMD_Tbl4_Intrinsic;
		def int_aarch64_neon_tbl1_temp : AdvSIMD_Tbl1_temp_Intrinsic;

def int_aarch64_neon_tbx1 : AdvSIMD_Tbx1_Intrinsic;		def int_aarch64_neon_tbx1 : AdvSIMD_Tbx1_Intrinsic;
def int_aarch64_neon_tbx2 : AdvSIMD_Tbx2_Intrinsic;		def int_aarch64_neon_tbx2 : AdvSIMD_Tbx2_Intrinsic;
def int_aarch64_neon_tbx3 : AdvSIMD_Tbx3_Intrinsic;		def int_aarch64_neon_tbx3 : AdvSIMD_Tbx3_Intrinsic;
def int_aarch64_neon_tbx4 : AdvSIMD_Tbx4_Intrinsic;		def int_aarch64_neon_tbx4 : AdvSIMD_Tbx4_Intrinsic;

let TargetPrefix = "aarch64" in {		let TargetPrefix = "aarch64" in {
class FPCR_Get_Intrinsic		class FPCR_Get_Intrinsic
▲ Show 20 Lines • Show All 1,645 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 677 Lines • ▼ Show 20 Lines	int TargetTransformInfo::getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Instruction *I) const {		const Instruction *I) const {
int Cost = TTIImpl->getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,		int Cost = TTIImpl->getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
Alignment, I);		Alignment, I);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

int TargetTransformInfo::getInterleavedMemoryOpCost(		int TargetTransformInfo::getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		Instruction I, unsigned VF, unsigned Opcode, Type VecTy, unsigned Factor,
unsigned Alignment, unsigned AddressSpace, bool UseMaskForCond,		ArrayRef<unsigned> Indices, unsigned Alignment, unsigned AddressSpace,
bool UseMaskForGaps) const {		bool UseMaskForCond, bool UseMaskForGaps) const {
int Cost = TTIImpl->getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		int Cost = TTIImpl->getInterleavedMemoryOpCost(
Alignment, AddressSpace,		I, VF, Opcode, VecTy, Factor, Indices, Alignment, AddressSpace,
UseMaskForCond,		UseMaskForCond, UseMaskForGaps);
UseMaskForGaps);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

int TargetTransformInfo::getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		int TargetTransformInfo::getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
FastMathFlags FMF,		FastMathFlags FMF,
unsigned ScalarizationCostPassed,		unsigned ScalarizationCostPassed,
▲ Show 20 Lines • Show All 712 Lines • Show Last 20 Lines

llvm/lib/CodeGen/InterleavedAccessPass.cpp

Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	private:
/// Transform an interleaved load into target specific intrinsics.		/// Transform an interleaved load into target specific intrinsics.
bool lowerInterleavedLoad(LoadInst *LI,		bool lowerInterleavedLoad(LoadInst *LI,
SmallVector<Instruction *, 32> &DeadInsts);		SmallVector<Instruction *, 32> &DeadInsts);

/// Transform an interleaved store into target specific intrinsics.		/// Transform an interleaved store into target specific intrinsics.
bool lowerInterleavedStore(StoreInst *SI,		bool lowerInterleavedStore(StoreInst *SI,
SmallVector<Instruction *, 32> &DeadInsts);		SmallVector<Instruction *, 32> &DeadInsts);

		/// Transform an type unmatched shufflevector into target specific intrinsics.
		bool lowerShuffleVector(ShuffleVectorInst *SI,
		SmallVector<Instruction *, 32> &DeadInsts);

/// Returns true if the uses of an interleaved load by the		/// Returns true if the uses of an interleaved load by the
/// extractelement instructions in \p Extracts can be replaced by uses of the		/// extractelement instructions in \p Extracts can be replaced by uses of the
/// shufflevector instructions in \p Shuffles instead. If so, the necessary		/// shufflevector instructions in \p Shuffles instead. If so, the necessary
/// replacements are also performed.		/// replacements are also performed.
bool tryReplaceExtracts(ArrayRef<ExtractElementInst *> Extracts,		bool tryReplaceExtracts(ArrayRef<ExtractElementInst *> Extracts,
ArrayRef<ShuffleVectorInst *> Shuffles);		ArrayRef<ShuffleVectorInst *> Shuffles);
};		};

▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines	if (!TLI->lowerInterleavedStore(SI, SVI, Factor))
return false;		return false;

// Already have a new target specific interleaved store. Erase the old store.		// Already have a new target specific interleaved store. Erase the old store.
DeadInsts.push_back(SI);		DeadInsts.push_back(SI);
DeadInsts.push_back(SVI);		DeadInsts.push_back(SVI);
return true;		return true;
}		}

		bool InterleavedAccess::lowerShuffleVector(
		ShuffleVectorInst SI, SmallVector<Instruction , 32> &DeadInsts) {

		LLVM_DEBUG(dbgs() << "IA: Found a shufflevector: " << *SI << "\n");

		// Try to create target specific intrinsics to replace the shuffle.
		if (!TLI->lowerShuffleVector(SI))
		return false;

		xbolva00Unsubmitted Not Done Reply Inline Actions Missing word? xbolva00: Missing word?
		wxz2020AuthorUnsubmitted Not Done Reply Inline Actions Will remove it. Wrong cut and paste. wxz2020: Will remove it. Wrong cut and paste.
		// Already have a new target specific tbl instruction. Erase the old
		// shufflevector.
		DeadInsts.push_back(SI);

		return true;
		}

bool InterleavedAccess::runOnFunction(Function &F) {		bool InterleavedAccess::runOnFunction(Function &F) {
auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();		auto *TPC = getAnalysisIfAvailable<TargetPassConfig>();
if (!TPC \|\| !LowerInterleavedAccesses)		if (!TPC \|\| !LowerInterleavedAccesses)
return false;		return false;

LLVM_DEBUG(dbgs() << "*** " << getPassName() << ": " << F.getName() << "\n");		LLVM_DEBUG(dbgs() << "*** " << getPassName() << ": " << F.getName() << "\n");

DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
Show All 11 Lines	for (auto &I : instructions(F)) {

if (StoreInst *SI = dyn_cast<StoreInst>(&I))		if (StoreInst *SI = dyn_cast<StoreInst>(&I))
Changed \|= lowerInterleavedStore(SI, DeadInsts);		Changed \|= lowerInterleavedStore(SI, DeadInsts);
}		}

for (auto I : DeadInsts)		for (auto I : DeadInsts)
I->eraseFromParent();		I->eraseFromParent();

		SmallVector<Instruction *, 32> SFDeadInsts;
		for (auto &I : instructions(F)) {
		if (ShuffleVectorInst *SHI = dyn_cast<ShuffleVectorInst>(&I))
		Changed \|= lowerShuffleVector(SHI, SFDeadInsts);
		}

		for (auto *I : SFDeadInsts)
		I->eraseFromParent();

return Changed;		return Changed;
}		}

llvm/lib/CodeGen/InterleavedLoadCombinePass.cpp

Show First 20 Lines • Show All 1,201 Lines • ▼ Show 20 Lines	bool InterleavedLoadCombineImpl::combine(std::list<VectorInfo> &InterleavedLoad,
unsigned ElementsPerSVI =		unsigned ElementsPerSVI =
InterleavedLoad.front().SVI->getType()->getNumElements();		InterleavedLoad.front().SVI->getType()->getNumElements();
VectorType ILTy = VectorType::get(ETy, Factor ElementsPerSVI);		VectorType ILTy = VectorType::get(ETy, Factor ElementsPerSVI);

SmallVector<unsigned, 4> Indices;		SmallVector<unsigned, 4> Indices;
for (unsigned i = 0; i < Factor; i++)		for (unsigned i = 0; i < Factor; i++)
Indices.push_back(i);		Indices.push_back(i);
InterleavedCost = TTI.getInterleavedMemoryOpCost(		InterleavedCost = TTI.getInterleavedMemoryOpCost(
Instruction::Load, ILTy, Factor, Indices, InsertionPoint->getAlignment(),		nullptr, 0, Instruction::Load, ILTy, Factor, Indices,
InsertionPoint->getPointerAddressSpace());		InsertionPoint->getAlignment(), InsertionPoint->getPointerAddressSpace());

if (InterleavedCost >= InstructionCost) {		if (InterleavedCost >= InstructionCost) {
return false;		return false;
}		}

// Create a pointer cast for the wide load.		// Create a pointer cast for the wide load.
auto CI = Builder.CreatePointerCast(InsertionPoint->getOperand(0),		auto CI = Builder.CreatePointerCast(InsertionPoint->getOperand(0),
ILTy->getPointerTo(),		ILTy->getPointerTo(),
▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

Show First 20 Lines • Show All 3,632 Lines • ▼ Show 20 Lines	void AArch64DAGToDAGISel::Select(SDNode *Node) {
case ISD::INTRINSIC_WO_CHAIN: {		case ISD::INTRINSIC_WO_CHAIN: {
unsigned IntNo = cast<ConstantSDNode>(Node->getOperand(0))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(Node->getOperand(0))->getZExtValue();
switch (IntNo) {		switch (IntNo) {
default:		default:
break;		break;
case Intrinsic::aarch64_tagp:		case Intrinsic::aarch64_tagp:
SelectTagP(Node);		SelectTagP(Node);
return;		return;
		case Intrinsic::aarch64_neon_tbl1_temp: {
		SDLoc Dl(Node);

		SmallVector<SDValue, 2> Ops;
		// the source vector
		Ops.push_back(Node->getOperand(1));
		// the mask
		Ops.push_back(Node->getOperand(2));
		ReplaceNode(Node,
		CurDAG->getMachineNode(AArch64::TBLv16i8One, Dl, VT, Ops));

		return;
		}
case Intrinsic::aarch64_neon_tbl2:		case Intrinsic::aarch64_neon_tbl2:
SelectTable(Node, 2,		SelectTable(Node, 2,
VT == MVT::v8i8 ? AArch64::TBLv8i8Two : AArch64::TBLv16i8Two,		VT == MVT::v8i8 ? AArch64::TBLv8i8Two : AArch64::TBLv16i8Two,
false);		false);
return;		return;
case Intrinsic::aarch64_neon_tbl3:		case Intrinsic::aarch64_neon_tbl3:
SelectTable(Node, 3, VT == MVT::v8i8 ? AArch64::TBLv8i8Three		SelectTable(Node, 3, VT == MVT::v8i8 ? AArch64::TBLv8i8Three
: AArch64::TBLv16i8Three,		: AArch64::TBLv16i8Three,
▲ Show 20 Lines • Show All 1,035 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 447 Lines • ▼ Show 20 Lines	public:

bool lowerInterleavedLoad(LoadInst *LI,		bool lowerInterleavedLoad(LoadInst *LI,
ArrayRef<ShuffleVectorInst *> Shuffles,		ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Factor) const override;		unsigned Factor) const override;
bool lowerInterleavedStore(StoreInst SI, ShuffleVectorInst SVI,		bool lowerInterleavedStore(StoreInst SI, ShuffleVectorInst SVI,
unsigned Factor) const override;		unsigned Factor) const override;

		bool lowerShuffleVector(ShuffleVectorInst *SI) const override;

bool isLegalAddImmediate(int64_t) const override;		bool isLegalAddImmediate(int64_t) const override;
bool isLegalICmpImmediate(int64_t) const override;		bool isLegalICmpImmediate(int64_t) const override;

bool shouldConsiderGEPOffsetSplit() const override;		bool shouldConsiderGEPOffsetSplit() const override;

EVT getOptimalMemOpType(const MemOp &Op,		EVT getOptimalMemOpType(const MemOp &Op,
const AttributeList &FuncAttributes) const override;		const AttributeList &FuncAttributes) const override;

▲ Show 20 Lines • Show All 393 Lines • ▼ Show 20 Lines	void ReplaceNodeResults(SDNode *N, SmallVectorImpl<SDValue> &Results,
SelectionDAG &DAG) const override;		SelectionDAG &DAG) const override;

bool shouldNormalizeToSelectSequence(LLVMContext &, EVT) const override;		bool shouldNormalizeToSelectSequence(LLVMContext &, EVT) const override;

void finalizeLowering(MachineFunction &MF) const override;		void finalizeLowering(MachineFunction &MF) const override;

bool shouldLocalize(const MachineInstr &MI,		bool shouldLocalize(const MachineInstr &MI,
const TargetTransformInfo *TTI) const override;		const TargetTransformInfo *TTI) const override;

		/// Create a tbl1 mask with default 0xFF.
		/// This function creates tbl1 mask whose elements are defaults to 0xff which
		/// means to fill '0' to the output vector.
		Constant *createTbl1Mask(IRBuilderBase &Builder,
		SmallVector<int, 16> &InputMask, unsigned NumElts,
		unsigned InputEltSize, unsigned OutputEltSize) const;
};		};

namespace AArch64 {		namespace AArch64 {
FastISel *createFastISel(FunctionLoweringInfo &funcInfo,		FastISel *createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo);		const TargetLibraryInfo *libInfo);
} // end namespace AArch64		} // end namespace AArch64

} // end namespace llvm		} // end namespace llvm

#endif		#endif

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,565 Lines • ▼ Show 20 Lines	if (StoreCount > 0)
BaseAddr, LaneLen * Factor);		BaseAddr, LaneLen * Factor);

Ops.push_back(Builder.CreateBitCast(BaseAddr, PtrTy));		Ops.push_back(Builder.CreateBitCast(BaseAddr, PtrTy));
Builder.CreateCall(StNFunc, Ops);		Builder.CreateCall(StNFunc, Ops);
}		}
return true;		return true;
}		}

		bool AArch64TargetLowering::lowerShuffleVector(ShuffleVectorInst *SI) const {
		IRBuilder<> Builder(SI);

		// First check the shuffle_vector instruction
		// 1) The first operand has to be 128 bit, byte mask requires the vector
		// size has to be 16*i8. We do not handle small vector shuffle here for
		// TBL1 instruction, for example, v2i16 size is 32
		// 2) The 2nd operand has to be UNDEF for tbl1 instruction
		if (SI->getOperand(0)->getType()->isVectorTy() &&
		SI->getOperand(0)->getType()->getPrimitiveSizeInBits() != 128)
		return false;

		// The 2nd operand has to be UNDEF
		if (Constant *C = dyn_cast<Constant>(SI->getOperand(1)))
		if (!(isa<UndefValue>(C)))
		return false;

		// We only handle shuffle_vector which has only one user instruction here,
		// because multiple user instructions will cause multiple tbl1 instructions
		// generated. we leave it to the next stage implementation
		if (!SI->hasOneUse())
		return false;

		// Now we check the one use instruction, we only handle UItoFP at this stage
		// and a few other instructions. The user instruction list can also be
		// expanded later
		auto UI = SI->user_begin();
		Instruction I = cast<Instruction>(UI);

		// we only support the following instructions at this stage
		// it can be expanded
		if (I->getOpcode() != Instruction::UIToFP &&
		I->getOpcode() != Instruction::FAdd &&
		I->getOpcode() != Instruction::FSub &&
		I->getOpcode() != Instruction::FMul &&
		I->getOpcode() != Instruction::Add &&
		I->getOpcode() != Instruction::Sub) {
		LLVM_DEBUG(dbgs() << "Quit Shuffle vector's user instruction not qualify : "
		<< *I << "\n");
		return false;
		}

		// Now we do the type check on the vector.
		// If the type of the input vector to the user instuction is the same the
		// output of the user instruction, then it is already handled in later DAG
		// lowering stage, no need to to handle them here
		VectorType *SVTy = SI->getType();
		if (SVTy == I->getType())
		return false;

		// At the point we exclude all the not handled situations, we can work out
		// the intrinsic call
		Type *SVEltTy = SVTy->getVectorElementType();
		unsigned SVNum = SVTy->getVectorNumElements();
		Type *PromotedIntTy;

		// Here we need to decide the tbl1 instruction's result type based on
		// its users (UIToFP) result type
		// As the result type can only be 64-bit or 32-bit float, we can set
		// corresponding integer type to the tbl1's result
		unsigned UIEltSize =
		I->getType()->getVectorElementType()->getScalarSizeInBits();
		if (UIEltSize == 64 && SVNum == 2)
		PromotedIntTy = Type::getInt64Ty(SI->getType()->getContext());
		else if (UIEltSize == 32 && SVNum == 4)
		PromotedIntTy = Type::getInt32Ty(SI->getType()->getContext());
		else
		return false;

		VectorType *VecTy = VectorType::get(PromotedIntTy, SVNum);

		// VecTy is the tbl1 result type, this needs to be worked out
		// Followed by tbl1 input source vector type
		Type *Tys[2] = {VecTy, SI->getOperand(0)->getType()};

		// Get the input Mask
		auto Mask = SI->getShuffleMask();

		// Generate the intrinsic function call
		Function *Tbl1Func = Intrinsic::getDeclaration(
		SI->getModule(), Intrinsic::aarch64_neon_tbl1_temp, Tys);

		// Generate one Tbl1 for each use, could merge if the uses are the same
		// in terms of the input type
		for (auto UI = SI->user_begin(), E = SI->user_end(); UI != E; UI++) {
		Instruction I = cast<Instruction>(UI);
		Type *UserTy = I->getType();

		// Two operands, 1st is the Mask, 2nd one is the input vector
		SmallVector<Value *, 2> Ops;

		// This is the vector operand to the Tbl1 intrisic, any vector type is OK
		// however we need to adjust it to match the user result type
		// we should be save to arbitarily change the type here however there could
		// be a problem in later passes
		Ops.push_back(SI->getOperand(0));

		// This is the mask operand to the Tbl1 intrinsic, it has to be v16i8 type
		// we need to work it out from the input mask together with the result type
		// input mask is SI->getOperand[2]
		// result type is the user of SI, I->getType()
		unsigned InputEltSize = SVEltTy->getPrimitiveSizeInBits();
		unsigned OutputEltSize =
		UserTy->getVectorElementType()->getPrimitiveSizeInBits();
		Value *Tbl1mask =
		createTbl1Mask(Builder, Mask, SVNum, InputEltSize, OutputEltSize);
		LLVM_DEBUG(dbgs() << "Tbl1 mask: "; Tbl1mask->dump());
		Ops.push_back(Tbl1mask);

		// Make the call for this user
		CallInst *Tbl1 = Builder.CreateCall(Tbl1Func, Ops);
		UI->replaceUsesOfWith(SI, Tbl1);
		}

		// Return true if it is successful
		return true;
		}

EVT AArch64TargetLowering::getOptimalMemOpType(		EVT AArch64TargetLowering::getOptimalMemOpType(
const MemOp &Op, const AttributeList &FuncAttributes) const {		const MemOp &Op, const AttributeList &FuncAttributes) const {
bool CanImplicitFloat =		bool CanImplicitFloat =
!FuncAttributes.hasFnAttribute(Attribute::NoImplicitFloat);		!FuncAttributes.hasFnAttribute(Attribute::NoImplicitFloat);
bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;		bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;
bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;		bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;
// Only use AdvSIMD to implement memset of 32-byte and above. It would have		// Only use AdvSIMD to implement memset of 32-byte and above. It would have
▲ Show 20 Lines • Show All 4,360 Lines • ▼ Show 20 Lines	if (MI.getOpcode() == TargetOpcode::G_GLOBAL_VALUE) {
// we don't want localized, as they can get moved into the middle of a		// we don't want localized, as they can get moved into the middle of a
// another call sequence.		// another call sequence.
const GlobalValue &GV = *MI.getOperand(1).getGlobal();		const GlobalValue &GV = *MI.getOperand(1).getGlobal();
if (GV.isThreadLocal() && Subtarget->isTargetMachO())		if (GV.isThreadLocal() && Subtarget->isTargetMachO())
return false;		return false;
}		}
return TargetLoweringBase::shouldLocalize(MI, TTI);		return TargetLoweringBase::shouldLocalize(MI, TTI);
}		}

		Constant *AArch64TargetLowering::createTbl1Mask(IRBuilderBase &Builder,
		SmallVector<int, 16> &InputMask,
		unsigned NumElts,
		unsigned InputEltSize,
		unsigned OutputEltSize) const {

		unsigned InputEltIdx = 0;
		unsigned CurrInputIdx = 0;
		unsigned CurrOffset;
		unsigned OffsetLeft = 0;
		unsigned OffsetRight = InputEltSize;

		SmallVector<Constant *, 16> Mask;
		for (unsigned Idx = 0; Idx < 16; Idx++) {
		// if all the elements are placed in the output vector, then just fill up
		// with out of range index
		if (InputEltIdx >= NumElts)
		Mask.push_back(Builder.getInt8(255));
		else {
		CurrOffset = Idx * 8;
		if (CurrOffset >= OffsetLeft && CurrOffset < OffsetRight) {
		CurrInputIdx = InputMask[InputEltIdx] * InputEltSize / 8 +
		(CurrOffset - OffsetLeft) / 8;
		Mask.push_back(Builder.getInt8(CurrInputIdx));
		}
		// finished one input element, move to the next
		else if (CurrOffset == OffsetRight) {
		InputEltIdx++;
		if (InputEltIdx >= NumElts) {
		Mask.push_back(Builder.getInt8(255));
		continue;
		}
		OffsetLeft = OutputEltSize * InputEltIdx;
		OffsetRight = OffsetLeft + InputEltSize;
		// check this new byte
		if (CurrOffset >= OffsetLeft && CurrOffset < OffsetRight) {
		CurrInputIdx = InputMask[InputEltIdx] * InputEltSize / 8 +
		(CurrOffset - OffsetLeft) / 8;
		Mask.push_back(Builder.getInt8(CurrInputIdx));
		} else
		Mask.push_back(Builder.getInt8(255));
		} else
		Mask.push_back(Builder.getInt8(255));
		}
		}
		return ConstantVector::get(Mask);
		}

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	if (isa<VectorType>(DataType)) {
unsigned EltSize =		unsigned EltSize =
DataType->getVectorElementType()->getScalarSizeInBits();		DataType->getVectorElementType()->getScalarSizeInBits();
return NumElements > 1 && isPowerOf2_64(NumElements) && EltSize >= 8 &&		return NumElements > 1 && isPowerOf2_64(NumElements) && EltSize >= 8 &&
EltSize <= 128 && isPowerOf2_64(EltSize);		EltSize <= 128 && isPowerOf2_64(EltSize);
}		}
return BaseT::isLegalNTStore(DataType, Alignment);		return BaseT::isLegalNTStore(DataType, Alignment);
}		}

int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,		int getInterleavedMemoryOpCost(Instruction *I, unsigned VF, unsigned Opcode,
		Type *VecTy, unsigned Factor,
ArrayRef<unsigned> Indices, unsigned Alignment,		ArrayRef<unsigned> Indices, unsigned Alignment,
unsigned AddressSpace,		unsigned AddressSpace,
bool UseMaskForCond = false,		bool UseMaskForCond = false,
bool UseMaskForGaps = false);		bool UseMaskForGaps = false);

bool		bool
shouldConsiderAddressTypePromotion(const Instruction &I,		shouldConsiderAddressTypePromotion(const Instruction &I,
bool &AllowPromotionWithoutCommonHeader);		bool &AllowPromotionWithoutCommonHeader);
Show All 36 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 671 Lines • ▼ Show 20 Lines	if (Ty->getVectorNumElements() < ProfitableNumElements) {
// We generate 2 instructions per vector element.		// We generate 2 instructions per vector element.
return NumVectorizableInstsToAmortize * NumVecElts * 2;		return NumVectorizableInstsToAmortize * NumVecElts * 2;
}		}
}		}

return LT.first;		return LT.first;
}		}

int AArch64TTIImpl::getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		int AArch64TTIImpl::getInterleavedMemoryOpCost(
unsigned Factor,		Instruction I, unsigned VF, unsigned Opcode, Type VecTy, unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices, unsigned Alignment, unsigned AddressSpace,
unsigned Alignment,		bool UseMaskForCond, bool UseMaskForGaps) {
unsigned AddressSpace,
bool UseMaskForCond,
bool UseMaskForGaps) {
assert(Factor >= 2 && "Invalid interleave factor");		assert(Factor >= 2 && "Invalid interleave factor");
assert(isa<VectorType>(VecTy) && "Expect a vector type");		assert(isa<VectorType>(VecTy) && "Expect a vector type");

if (!UseMaskForCond && !UseMaskForGaps &&		if (!UseMaskForCond && !UseMaskForGaps &&
Factor <= TLI->getMaxSupportedInterleaveFactor()) {		Factor <= TLI->getMaxSupportedInterleaveFactor()) {
unsigned NumElts = VecTy->getVectorNumElements();		unsigned NumElts = VecTy->getVectorNumElements();
auto *SubVecTy = VectorType::get(VecTy->getScalarType(), NumElts / Factor);		auto *SubVecTy = VectorType::get(VecTy->getScalarType(), NumElts / Factor);

// ldN/stN only support legal vector types of size 64 or 128 in bits.		// ldN/stN only support legal vector types of size 64 or 128 in bits.
// Accesses having vector types that are a multiple of 128 bits can be		// Accesses having vector types that are a multiple of 128 bits can be
// matched to more than one ldN/stN instruction.		// matched to more than one ldN/stN instruction.
if (NumElts % Factor == 0 &&		if (NumElts % Factor == 0 &&
TLI->isLegalInterleavedAccessType(SubVecTy, DL))		TLI->isLegalInterleavedAccessType(SubVecTy, DL))
return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);		return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL);
}		}

return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		// we now check to see if this interleave memory access can be lowered
Alignment, AddressSpace,		// to TBL1 instruction later in the IntereleavedAccessPass
		// if True, then the cost will be the number of TBL1 * the basic cost of
		// TBL1 instruction which is set to 1 at this time
		if (I && VF > 1 && I->hasOneUse()) {
		auto UI = I->user_begin();
		Instruction UserInstruction = cast<Instruction>(UI);
		// We currently just support the following instructions, can be expanded
		if (UserInstruction->getOpcode() == Instruction::UIToFP \|\|
		UserInstruction->getOpcode() == Instruction::FAdd \|\|
		UserInstruction->getOpcode() == Instruction::FSub \|\|
		UserInstruction->getOpcode() == Instruction::FMul \|\|
		UserInstruction->getOpcode() == Instruction::Add \|\|
		UserInstruction->getOpcode() == Instruction::Sub) {
		// the first check to make sure the result can form a 128-bit vector
		// the 2nd check to make sure the input data can fit into 128-bit vector
		// so that we can use tbl1 instruction
		// there will be Group->getFactor() tbl1 generated, each tbl1 costs 1
		if ((UserInstruction->getType()->getScalarSizeInBits() * VF == 128) &&
		(I->getType()->getScalarSizeInBits() * Factor * VF == 128))
		return Factor * 1;
		}
		}

		return BaseT::getInterleavedMemoryOpCost(I, VF, Opcode, VecTy, Factor,
		Indices, Alignment, AddressSpace,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}

int AArch64TTIImpl::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) {		int AArch64TTIImpl::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) {
int Cost = 0;		int Cost = 0;
for (auto *I : Tys) {		for (auto *I : Tys) {
if (!I->isVectorTy())		if (!I->isVectorTy())
continue;		continue;
▲ Show 20 Lines • Show All 295 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,924 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i < InterleaveFactor; i++)
if (Group->getMember(i))		if (Group->getMember(i))
Indices.push_back(i);		Indices.push_back(i);
}		}

// Calculate the cost of the whole interleaved group.		// Calculate the cost of the whole interleaved group.
bool UseMaskForGaps =		bool UseMaskForGaps =
Group->requiresScalarEpilogue() && !isScalarEpilogueAllowed();		Group->requiresScalarEpilogue() && !isScalarEpilogueAllowed();
unsigned Cost = TTI.getInterleavedMemoryOpCost(		unsigned Cost = TTI.getInterleavedMemoryOpCost(
I->getOpcode(), WideVecTy, Group->getFactor(), Indices,		I, VF, I->getOpcode(), WideVecTy, Group->getFactor(), Indices,
Group->getAlign().value(), AS, Legal->isMaskRequired(I), UseMaskForGaps);		Group->getAlign().value(), AS, Legal->isMaskRequired(I), UseMaskForGaps);

if (Group->isReverse()) {		if (Group->isReverse()) {
// TODO: Add support for reversed masked interleaved access.		// TODO: Add support for reversed masked interleaved access.
assert(!Legal->isMaskRequired(I) &&		assert(!Legal->isMaskRequired(I) &&
"Reverse masked interleaved access not supported.");		"Reverse masked interleaved access not supported.");
Cost += Group->getNumMembers() *		Cost += Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
▲ Show 20 Lines • Show All 2,083 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Expand interleaved memory access pass to identify certain shuffle_vector and transform it into target specific intrinsics.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 253942

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/include/llvm/IR/IntrinsicsAArch64.td

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/CodeGen/InterleavedAccessPass.cpp

llvm/lib/CodeGen/InterleavedLoadCombinePass.cpp

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Expand interleaved memory access pass to identify certain shuffle_vector and transform it into target specific intrinsics.
Needs ReviewPublic