Page MenuHomePhabricator

Expand interleaved memory access pass to identify certain shuffle_vector and transform it into target specific intrinsics.
Needs ReviewPublic

Authored by wxz2020 on Feb 28 2020, 3:24 PM.

Details

Summary

E.g. An interleaved load (Factor = 4):

%wide.vec = load <8 x i16>, <8 x i16>* %ptr
%strided.vec = shuffle <8 x i16> %wide.vec, <8 x i16> undef, <2 x i32><i32 0, i32 4>

%v1 = uitofp <2 x i16> %strided.vec to <2 x double>

It can be transformed into a tbl1 intrinsic in AArch64 backend to avoid the high cost extract/insert sequences.

The change is also summarized in calculating InterleavedMemoryOpCost in loop vectorizer for decision in
loop vectorization.

This change will give SPEC2017 538.imagick_r 11.5% performance boost.

Tested using:
%llvm/build/bin/llvm-lit ../../test/*

And there is no regression on the test.

And we also tested this on SPEC2017 whole suite and they all pass and there is no performance regression.

Diff Detail

Event Timeline

wxz2020 created this revision.Feb 28 2020, 3:24 PM
dmgreen added a subscriber: dmgreen.Mar 1 2020, 7:45 AM

Hello. Looks interesting. We ended up doing something similar with lowering of interleaved access groups to VMOVN instructions in MVE. It went straight through ISel though, not needing to go via the InterleavedAccessPass. I don't immediately see why this case would need to be done differently. It looks like ISel can already generate at least some TBL1 instructions.

Can you add some testcases for this? Both for producing this in the vectorizer/costmodel tests and for the backend codegen of the load+shuffle patterns you expect to see.

Some other initial thoughts:

  • Can you run clang-format over the patch. That would lower the amount of noise from the lint bot.
  • VF can be calculated from VecTy and Factor.
  • Checking the instruction users are a certain kind sounds odd. Can you explain why it's checking that and only generating in those cases?
  • I was half expecting BaseT::getInterleavedMemoryOpCost to return something like getMemoryOpCost + getShuffleCost, but it seems to use the cost of inserts + extracts.
wxz2020 updated this revision to Diff 248210.Mar 4 2020, 9:17 AM

This upload changed the clang-format problems.

wxz2020 updated this revision to Diff 248253.Mar 4 2020, 11:01 AM

yet another missing format error.

wxz2020 updated this revision to Diff 248282.Mar 4 2020, 12:36 PM

Fix clang-tidy warning and one clang-format error.

lkail added a subscriber: lkail.Mar 7 2020, 2:48 AM
xbolva00 added inline comments.
llvm/lib/CodeGen/InterleavedAccessPass.cpp
458

Missing word?

wxz2020 added inline comments.Mar 9 2020, 8:43 AM
llvm/lib/CodeGen/InterleavedAccessPass.cpp
458

Will remove it. Wrong cut and paste.

wxz2020 updated this revision to Diff 249122.Mar 9 2020, 8:56 AM

Fix a minor comment.

The title of this patch confused me. This is not adding a pass; it is extending existing passes.
I did not see an answer to @dmgreen's questions - maybe the biggest one: why is this transform not done entirely in SelectionDAG?
And why would we want AArch-specific "tbl1" pattern-matching in the generic VectorUtils.h?

On 05/03/2020 00:35, Wei Zhao wrote:

  1. Why not do this work in ISel?

    In LLVM IR, we have shufflevector In LLVM DAG, we have vector_shuffle

    In ISel, the normal process to translate LLVM IR shufflevector to TBLn instruction is to translate it first to DAG node vector_shuffle, and then in the DAG Legalize() to lower it to machine specific instructions like TBLn.

    There is a very strict and deeply rooted convention or requirement on how to translate LLVM IR shufflevector -> DAG node vector_shuffle.

    In the following example: %v3 = shufflevector <8 x i8> %v1, <8 x i8> %v2, <8 x i32> <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6> ; yields <8 x i8>

    %v3 and %v1, %v2 have to be the same type.

    If not, ISel won't translate it to DAG node vector_shuffle and later it won't be lowered into a TBLn instruction. Instead, the ISel(it has at most 9 steps in total) will lower it into a sequence of extract/insert instructions to form a new vector, as in the above example, %v3 will be generated by a sequence of extract/insert instructions.

    Because ISel is coded like this, it is very hard to break this rule and put the handling of irregular (mismatched type) shufflevector in the ISel stage.

    To my understanding, the type match requirement on shuffle vector can be dated back to the early days of LLVM when it was used to generate code for Motorola Altivec.

    Because of all these, we decided to following the example of Ldn/Stn generation in InterleavedAccessPass(), just directly generate Tbl1 instruction from LLVM IR before entering ISel.

You may be right that IR is the simpler place for this to be. If we could add it to ISel though, that would be the more natural place for it. And would stop us having to look at uses like this. ISel knows about all the types and lanes that are going on at the machine level. A lot of the times that these pre-isel passes have been added feel like a mistake to me. When there is masking involved it is sometimes really required, but dagcombine is good at taking many things into account, optimising many things all at once.

There is ReconstructShuffle and isShuffleMaskLegal, but they might not be applicable with the way that this example goes via a smaller legal vector. At one point it looks like we have:

        t5: v8i16,ch = load<(load 16 from %ir.scevgep238, align 2)> t0, t2, undef:i64
      t16: i32 = extract_vector_elt t5, Constant:i64<0>
      t17: i32 = extract_vector_elt t5, Constant:i64<4>
    t20: v2i32 = BUILD_VECTOR t16, t17
    t22: v2i32 = BUILD_VECTOR Constant:i32<65535>, Constant:i32<65535>
  t24: v2i32 = and t20, t22
t26: v2i64 = zero_extend t24

That going into and out of a v2i32/v4i32 is not going to be efficient. It may make sense to take sequences like this and flatten them to simpler code, potentially using the VTBL like you propose. (It might turn out this example is nothing like your actual code though, or that some cases are not worth optimising like that. The VTBL is a powerful instruction, but comes at a cost. We need to be careful not to overuse it where simpler code would be better.)

I'm not convinced either way, but needing to check the uses lie that sounds off to me.

  • Wei's answer on 3/9/2020

ISel is a cascaded and complex process. Before the actual MC (machine code) selection, it usually goes through the following steps:
#1 Combine (1)
#2 legalize_types()
#3 Combine (lt)
#4 LegalizeVectors()
#5 Legalize_types()
#6 Combine (lv)
#7 Legalize()
#8 Combine(2)
#9 Instruction selection begins

#4, #5, and #6 are executed only if there are vector instructions.

Combine() is the place where DAG node is combined, optimized and adjusted to get it close to the final MC. It is called 4 times at most and in every time it does the almost the same thing to the DAG forest.

Currently, in your above example, #1 combine() can combine t16, t17, t20 into in a DAG node vector_shuffle if t5 and t20 are of the same type. The reason for this "same type" requirement is not for TBLn instructions, as TBLn instructions have no concept of any type, it only knows byte. "same type" is required here mainly for the output position. Once the type matched, it can figure out both the input/output position thus works out the TBL mask. For the mismatched type situation, just like in the above example, the current ISel will leave it to #7 Legalize() to translate them to extract/insert so that it can be mapped to MC.

Your above code sequence is actually an intermediate result after #5 legalize_types().

t16/t17 should be i16 after #1 Combine(1), the Combine() process works in bottom-up way, after it visits t20(BUILD_VECTOR), it will then visit t17, t16.
In #2 legalize_types(), it finds out t16/t17 and they are scalars, and on target AArch64, scalar should be at least 32-bit, so it promotes them to 32-bit.

In #4 and #5, to accommodate v2i16->v2i32, it needs to insert AND, and from v2i32->v2i64, it need to insert zero_extend.

So if we want to do the work in ISel, it has to be done in the #1 combine(1), otherwise, the later passes will add more instructions and type promotions which make the changes not profitable and even not possible.

However, to do it in #1 combine(1) is hard. The DAG combine() pass is a bottom-up process. It makes the combining of the 3 layers nodes very difficult.

To combine this DAG (extract_vector_elt/extract_vector_elt/BUILD_VECTOR) we need to look at the BUILD_VECTOR's user instruction.

The two extract_vector_elt gives the input elements' information. We need BUILD_VECTOR's user instruction to figure out its output position. Just like the TBLn instruction's mask, it contains both input vector element position(the index number in each byte) and the place(the byte position) they should be placed in the mask(the output position).

However, when combine(1) visits BUILD_VECTOR node, its user node has already been processed. You can find BUILD_VECTOR's input nodes, you cannot find BUILD_VECTOR's output nodes or its user instructions. Also it might have multiple user instructions. So the only way to solve this is to look at every possible DAG node(they could be a user instruction of the BUILD_VECTOR node) and try to look at its parent node (check to see if it is BUILD_VECTOR node) and look at its grandparent node to see if they are extract_vector_elt node. In this way, it needs to look at 3 layers of DAG nodes to decide if they can be combined into one vectorshuffle DAG node(TBLn equivalent). We cannot say this is not doable, but it is too tedious and error-prone.

From the design philosophy of DAG lowering/combining, we think it is designed to deal with a local DAG node or DAG node with its parent. Not this 3 generation DAG region. While in LLVM IR, this becomes a very easy task, everything is linear there, and the shufflevector IR already has its input mask, all we need to do is to figure out its output position form its user instruction so that we can get the complete information for TBL byte mask.

  • END (Wei's answer on 3/9/2020)
  1. Why looking at shufflevector's user instruction?

    Let's look at another example: %wide.vec = load <8 x i16>, <8 x i16>* %scevgep238, align 2, !tbaa !8 %strided.vec = shufflevector <8 x i16> %wide.vec, <8 x i16> undef, <2 x i32> <i32 0, i32 4> %2 = uitofp <2 x i16> %strided.vec to <2 x double>

    Basically it loads a 128bit data into a V register, and it will extract the 0th 16bit and the 4th 16bit to form a v2i16 vector. Note here, the %strided.vec has a type v2i16, and the input vector has a type v8i16, they are not the same. So based on the above analysis, the current ISel will lower this into a sequence of extract/insert+shifting+masking instructions, its cost is very high.

    The type match requirement is not baseless as it is met it can help to form a "complete" or "128bit" vector with every byte defined so that its user instruction can work on it.

    In the above example, the result of the shufflevector is a v2i16 vector, it cannot be directly used by its user instruction uitofp without shifting and alignment according to the type <2 x double>.

    So by looking at the user instruction, we can figure out the corresponding Tbl1 instruction's mask. In this case, we know the first element of %strided.vec <v2i16> should have an offset 0(byte offset), while the 2nd element should have an offset of (2-1)*bitwidth(double). On AArch64, TBLn instruction has no type concept, it only knows byte. The mask is byte based. In the above example, we can use AArch64's Tbl1 instruction to place the v2i16 vector into a 128-bit register and avoid those high cost of extract/insert(actually, also a few shifting/masking) instructions. This brings huge performance gain.

I see. You are essentially looking for a <2 x double>, so that you know the lanes the result will need to end up in.

Am I right in saying that in that example above this would be the same as (and (load v8i16, 0x000000000000ffff000000000000ffff)). Because the lanes are already in the correct place, they just need to be masked off? In other cases they would need to be shifted too (so a single vtbl may do better).

Another point for ISel would be that it knows these types. Not sure if it would be better or just as awkward.

  • Wei's answer on 3/9/2020

Explained above.

  • END (Wei's answer on 3/9/2020)
  1. Cost model Because shufflevector on mismatched type vectors are usually lowered into a sequence of extract/insert instructions, LLVM loop vectorizer give a very high cost to interleaved memory access based on this extract/insert approach to form a vector. For example, in the above example, to form a v2i16 vector from v8i16 vector, when VF is 2, UF=4, the cost is 34 for this interleaved memory access. This high cost will kill this loop's vectorization as most other instructions usually costs 0/1. Without the use of Tbl1 instruction to form a new vector at a low cost, many loops will not be vectorized because of this high cost.

    On Marvell's thunder2tx99, the Tbl1 cost 5 cycles to finish which is almost the same a fadd/fmul. While Tbl2/3/4 will cost more cycles to finish.

    Tbl1 definitely cost much less cycles than a sequence of extract/insert instructions to forma vector. And this is the motivation of our work.

Yep. I was thinking that the VTBL would cost 1, but that doesn't also account for the cost of the load. (The VTBL is also probably 3 instructions if I'm understanding correctly, an adr, a load and the vtbl. Plus a constant pool block. But the adr and load should be pulled up out of the loop, making them free for the vectorizers cost model. They just end up taking up an extra register).

I think it's worth considering this as at least 2 separate parts, one that gets the backend to lower to the better sequences of instructions (through this or through ISel if it would work), and the other that gets the vectorizer to produce the code we want (by adjusting the cost model).

Both patches need tests to show that they are behaving correctly. Those tests might also shed more light on the best way forward here.

Looks very interesting,
Dave

  • Wei's answer on 3/9/2020

The current LLVM loop vectorizer cost model is based on the assumption that for interleaved memory access there is only one way to form an interleaved vector. That is the extract/insert model. Extract scalars from a vector and insert a few of them back to form a new vector. The cost model basically mimics this process to calculate the cost. Nothing wrong here, just usually the cost is very high because to form a vector, it needs a sequence of extract/insert to do the job.

With our work, when certain conditions are met, the interleaved vector can be formed by Tbl1 instruction. So the cost model should return a different, usually much lower cost, to the loop vectorizer under such conditions. And let the loop vectorizer make the call to see if vectorizing the loop is profitable or not.

To make the loop vectorizer's cost model Tbl1 instruction AWARE is one part of the work. Otherwise, the loop vectorizer will make decision assuming there is no Tbl1 instruction to form an interleaved vector.

  • END (Wei's answer on 3/9/2020)

Wei Zhao

    __o   Hurry ...
_  \<,_

(_)/ (_)
~~~~~~~~~~~~

-----Original Message-----
From: Dave Green via Phabricator <reviews@reviews.llvm.org>
Sent: Sunday, March 1, 2020 7:46 AM
To: Wei Zhao <wxz@marvell.com>; mail@justinbogner.com; dorit.nuzman@intel.com
Cc: david.green@arm.com; joelkevinjones@gmail.com; kristof.beyls@arm.com; hiraditya@msn.com; llvm-commits@lists.llvm.org; t.p.northover@gmail.com; mcrosier@codeaurora.org; florian_hahn@apple.com; simon.moll@emea.nec.com; daniel.kiss@arm.com
Subject: [EXT] [PATCH] D75388: Add a pass to identify certain shuffle_vector and transform it into target specific intrinsics.

External Email


dmgreen added a comment.

Hello. Looks interesting. We ended up doing something similar with lowering of interleaved access groups to VMOVN instructions in MVE. It went straight through ISel though, not needing to go via the InterleavedAccessPass. I don't immediately see why this case would need to be done differently. It looks like ISel can already generate at least some TBL1 instructions.

Can you add some testcases for this? Both for producing this in the vectorizer/costmodel tests and for the backend codegen of the load+shuffle patterns you expect to see.

Some other initial thoughts:

  • Can you run clang-format over the patch. That would lower the amount of noise from the lint bot.
  • VF can be calculated from VecTy and Factor.
  • Checking the instruction users are a certain kind sounds odd. Can you explain why it's checking that and only generating in those cases?
  • I was half expecting BaseT::getInterleavedMemoryOpCost to return something like getMemoryOpCost + getShuffleCost, but it seems to use the cost of inserts + extracts.

Repository:

rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION

https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D75388_new_&d=DwIFAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=uyxacxdjzpq-fLmkeDKKtQ&m=0n9ZhNTN1_I6H0N4p3WRoaHCWAvbNBqGaASQfJXEdtE&s=1psgguuIwGypLJw0arv2yLvATznte7hELMxvk0Sf3j4&e=

https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D75388&d=DwIFAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=uyxacxdjzpq-fLmkeDKKtQ&m=0n9ZhNTN1_I6H0N4p3WRoaHCWAvbNBqGaASQfJXEdtE&s=syU3AG-gSHXaKkNZZ8787kW7omfRxJZ45YQxNkNX2FE&e=

Sanjay Patel :The title of this patch confused me. This is not adding a pass; it is extending existing passes.
WZ: Agree. It is not a new LLVM pass. The work is piggy backed on an existing pass. I will change the title.

Sanjay Patel: I did not see an answer to @dmgreen's questions - maybe the biggest one: why is this transform not done entirely in SelectionDAG?
WZ: The answer is in the above comment.

And why would we want AArch-specific "tbl1" pattern-matching in the generic VectorUtils.h?
WZ; Good point! This is an AArch64 specific optimization. The mask is based on AArch64's Tbl1 instruction mask definition. So it should be in a target specific place. I will work out a fix and upload it.

Thanks!

wxz2020 retitled this revision from Add a pass to identify certain shuffle_vector and transform it into target specific intrinsics. to Expand interleaved memory access pass to identify certain shuffle_vector and transform it into target specific intrinsics..Mar 18 2020, 2:15 PM

I'm still not convinced that this shouldn't be done in ISel. There's nothing cross-block going on, so this is what ISel is designed for. It might make sense not to think of this as trying to convert vector_shuffle to something else, but instead trying to convert what vector_shuffle has turned into into something more optimal. In that one case I looked at, there was something like a (v4i32 (ext (v2i32 (buildvector (v4i32,..))). We get this way because a v2i16 was legalised to a v2i32, but everything around it was a v4i32. Can we "flatten" the ext into a single BUILDVECTOR? I have not had time to see if that is or isn't possible, but it sounds more sensible than very special case pre-isel legalisation for certain shuffle_vector's.

Also, as I said before:

  • This should be 2 separate patches.
  • They both need (lots of) tests. :)
wxz2020 updated this revision to Diff 253724.Mar 30 2020, 3:43 PM

Move the creation of the Tbl1 instruction's mask from LLVM generic code to AArch64 target specific code.

wxz2020 updated this revision to Diff 253886.Mar 31 2020, 7:31 AM

fix clang-format issue.

wxz2020 updated this revision to Diff 253923.Mar 31 2020, 10:09 AM

minor format change

wxz2020 updated this revision to Diff 253942.Mar 31 2020, 11:46 AM

fix the warning.

wxz2020 updated this revision to Diff 257744.Apr 15 2020, 8:58 AM

Patch the function interface changes in other targets to avoid build failure for ARM, Hexagon, PowerPC, SystemZ and X86.

RKSimon added inline comments.Apr 15 2020, 9:41 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
942

Since you're changing the function signature anyway would it make sense to change this to a VectorType *VecTy? There's been chatter on other patches about making this jump in general as many of the TTI calls expect a vector anyhow.

wxz2020 added inline comments.Apr 15 2020, 9:56 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
942

Can you give me an example about this?

RKSimon added inline comments.Apr 15 2020, 11:41 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
942

@ctetreau's work on cleaning up the vector getters e.g. D77264 hits this a lot

ctetreau added inline comments.Apr 15 2020, 2:09 PM
llvm/include/llvm/Analysis/TargetTransformInfo.h
942

I'm working on a major cleanup and refactor of VectorType. As part of this, getVectorNumElements, getVectorElementCount, getVectorElementType, and getVectorIsScalable are all being removed.

If the VecTy argument to this function is expected to be an instance of VectorType, then it should take a pointer to VectorType. Then, in the body, code can directly call the methods of VectorType without using the base Type asserting getters.

wxz2020 updated this revision to Diff 257879.Apr 15 2020, 3:47 PM

Fix the precheck errors.

RKSimon added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
942

@samparker has examples of the VectorType change in D78357

wxz2020 updated this revision to Diff 259577.Apr 23 2020, 8:26 AM

Merged with the current code.

wxz2020 updated this revision to Diff 259724.Apr 23 2020, 3:08 PM

Fix a few format problems.

wxz2020 updated this revision to Diff 259778.Apr 23 2020, 6:49 PM

An other miss the format. fixed