This is an archive of the discontinued LLVM Phabricator instance.

Optimize scattered vector insert/extract pattern
Needs ReviewPublic

Authored by hulx2000 on May 15 2015, 4:51 PM.

Details

Summary
This patch transform the following IR:
  %1 = extractelement <8 x i8> %v1, i32 0
  %conv1 = zext i8 %1 to i16
  %2 = extractelement <8 x i8> %v1,  i32 1
  %conv2 = zext i8 %2 to i16
  ...
  store i16 %conv1, i16* %arrayidx1
  store i16 %conv2, i16*  %arrayidx2
Into:
  %1 = zext <8 x i8> %v1 to <8 x i16>
  %2 = extractelement <8 x i16> %1, i32 0
  %3 = extractelement <8 x i16> %1, i32 1
  ...
  store i16 %2, i16* %arrayidx1
  store i16 %3, i16*  %arrayidx2

And transform the following IR:
  %1 = load i8, i8* %arrayidx1
  %conv1 = zext i8 %1 to i16
  %2 = load i8, i8* %arrayidx2
  %conv2 = zext i8 %2 to i16
  ...
  %x0 = insertelement <8 x i16> undef, i16 %conv1, i32 0
  %x1 = insertelement <8 x i16> %x0, i16 %conv2, i32 1
Into:
  %1 = load i8, i8* %arrayidx1
  %2 = load i8, i8* %arrayidx2
  %9 = insertelement <8 x i8> undef, i8 %1, i32 0
  %10 = insertelement <8 x i8> %9, i8 %2, i32 1
  ...
  %17 = zext <8 x i8> %16 to <8 x i16>

As a summary, if vector is N x M, it save N-1 ext instructions.

Diff Detail

Repository
rL LLVM

Event Timeline

hulx2000 retitled this revision from to Optimize scattered vector insert/extract pattern.
hulx2000 updated this object.
hulx2000 edited the test plan for this revision. (Show Details)
hulx2000 set the repository for this revision to rL LLVM.
hulx2000 added a subscriber: Unknown Object (MLST).

I've added a few nits.

Reviewers might be interested to know why you're running the ADCE pass after the SLP pass.

I'll defer to Tim, James, and others to comment on the overall approach.

lib/Transforms/Vectorize/SLPVectorizer.cpp
64

Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.

68

Please remove the cl::ZeroOrMore option. These should not be used with cl::opt.

405

Please do not add white space.

3122

Why is the necessary?

3130

Same. Is this necessary?

3212

80-column violation?

4176

Please add comments for the various cases you're trying to detect and avoid.

4185

Running clang-format might resolve some of the formatting issues.

4201

Don't evaluate .size() every iteration.

4344

Maximize 80-column.

test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
6

I assume you want to remove this comment along with the others?

130

Shouldn't we be checking something here?

test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
168

Shouldn't we be checking something here?

180

Shouldn't we be checking something here?

mcrosier updated this object.May 18 2015, 6:08 AM
mcrosier added reviewers: t.p.northover, jmolloy.

I need ADCE to clean up some code left by SLPVectorizer, and ADCE is only run once in the whole compiling life time, so adding one pass is not a bad thing.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3122

This is to work around compile warning, similar to existing code

3130

This is to work around compile warning, similar to existing code

3212

will fix that, thx

4176

Comments are before loop

4185

will do that, thanks.

4201

will fix that

4344

will fix that

test/Transforms/SLPVectorizer/AArch64/combine-extractelement.ll
130

This case is for future extension, I can remove that, but it does hurt to keep it here.

test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
168

This case is for future extension, I can remove that, but it does hurt to keep it here.

180

This case is for future extension, I can remove that, but it does hurt to keep it here.

hulx2000 updated this object.
hulx2000 updated this object.
nadav edited edge metadata.Jun 2 2015, 12:57 PM

Hi Lawrence,

The SLP vectorizer already supports collecting trees that start at insertElement (see “findBuildVector”), and definitely supports trees that start at stores. It looks like you are adding special handling for these instructions just to work around the cost model, which is the wrong way of implementing vectorization of insert/extract instructions. Did you look into the code that calculates the cost of vector zext/sext?

-Nadav

Hi, Nadav:

Thanks for your comments.

This is a joint patch between I and Ana, Yes I noticed there are some codes for insertElement, however since it doesn't catch our case, I didn't check if it could be expanded.

Sorry for the late reply, I was asked to focus on a release feature, I will go to it and take a detailed look after this feature is done, hopefully this week or next week.

Thanks

Lawrence Hu

Hi, Nadav:

Very sorry to get back to you so late.

I did more investigation on existing code, for the following code example:

%1 = load i32, i32* %arrayidx1

%conv1 = zext i32 %1 to i64
%2 = load i32, i32* %arrayidx2
%conv2 = zext i32 %2 to i64
%x0 = insertelement <2 x i64> undef, i64 %conv1, i32 0
%x1 = insertelement <2 x i64> %x0, i64 %conv2, i32 1
ret <2 x i64> %x1

The existing logic will generate the following IRs (I have to by pass the cost function to get this ), which is not efficient, probably that's why the cost function doesn't allow it:

%1 = load i32, i32* %arrayidx1
%2 = load i32, i32* %arrayidx2
%3 = insertelement <2 x i32> undef, i32 %1, i32 0
%4 = insertelement <2 x i32> %3, i32 %2, i32 1
%5 = zext <2 x i32> %4 to <2 x i64>
%6 = extractelement <2 x i64> %5, i32 0
%x0 = insertelement <2 x i64> undef, i64 %6, i32 0
%7 = extractelement <2 x i64> %5, i32 1
%x1 = insertelement <2 x i64> %x0, i64 %7, i32 1
ret <2 x i64> %x1

However, the following IRs are more much efficient:

%1 = load i32, i32* %arrayidx1
%2 = load i32, i32* %arrayidx2

%3 = insertelement <2 x i32> undef, i32 %1, i32 0

%4 = insertelement <2 x i32> %3, i32 %2, i32 1

%5 = zext <2 x i32> %4 to <2 x i64>

That's what our patches do.

Because our code is for this particular pattern, and it generate much more efficient code, I would think keeping our code is a reasonable choice.

What do you think?

Regards

Lawrence Hu

Forgot to mention, I investigated why the cost function doesn't allow further processing: it is because the loads in my example are not from consecutive memory location, then gather operation is need, when NeedGather is true, the cost function won't allow further vectorization.

Hi Lawrence,

I haven't looked into this patch in details, but I have a couple of suggestions that would help further review:

  1. upload the patch with full context
  2. separate independent parts into different patches (e.g. adding ADCE pass after SLP is totally independent on the new stuff you implemented in SLP)
  3. Please describe the case you're working on in IR terms, not asm. SLP operates on IR level, so it's easier to grasp what transformation you're seeking if we use IR.

Thanks,
Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
4346

SmallPtrSet could be used here instead.

test/Transforms/SLPVectorizer/AArch64/combine-insertelement.ll
1

The new functionality in SLP should be tested independently on other passes. If you're also interested in outcome of subsequent ADCE, then you might want to add *another* test for ADCE (the output of SLP would be the input for ADCE).

45–47

I believe that wrapped line would be a syntax error.

86–87

Some line is missing here.

165

No reason to add this now - when in future you submit another patch with the extension, you'll be asked to add a testcase.

hulx2000 updated this object.
hulx2000 edited edge metadata.

ping

hulx2000 marked 19 inline comments as done.Sep 23 2015, 3:09 PM

At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:

  1. I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.
  2. The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).
lib/Transforms/Vectorize/SLPVectorizer.cpp
4215

embeeded -> embedded

4223

Why? This does not seem necessary. It seems as though this could be profitable for any Size >= 2*(number of underlying vector ext instructions).

Thanks Michael, just see your comments (not inline comments).

At a high level, this transformation seems overly restrictive, and will need cost-modeling work. A couple of thoughts:

  1. I don't see why you're restricting this to extracts used by stores (or inserts fed by loads); if the goal is to save on [zs]ext instructions, then this seems profitable regardless of how these are used. Moreover, I don't understand why there's a hasOneUse() check on the [zs]ext instructions.
  2. The [zs]ext instructions that you're trying to eliminate might be free, at least in combination with the extract or insert, rendering this a bad idea. Consider the (unfortunately common) case where the target does not actually support a vector extract at all, and so it is lowered by storing the vector on the stack and then doing a scalar load of the requested element. In this case, if the target supports the corresponding scalar extending load, the extension is free. Likewise, for those [zs]ext fed by loads, these might be free if the target supports the corresponding extending load. Worse, the vector [zs]ext you're forming might not be legal at all (this is the most-serious potential problem).

We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation.

lib/Transforms/IPO/PassManagerBuilder.cpp
290

The SLP vectorizer should clean after itself. Is it not?

360

The SLP vectorizer should clean after itself. Is it not?

504

Why do we need ADCE here? the SLP vectorizer should clean up after itself. We already have DCE and CSE built into the SLP-vectorizer.

lib/Transforms/Vectorize/SLPVectorizer.cpp
68

Why do we need two flags for insert and extract? Do you feel like this feature is experimental?

Did you run some performance measurements on the llvm test suite? Are you seeing any wins?

3116

This part looks fine.

3203

What does the function return?

3209

Please document the functions below.

4092

What's going on here? Why do you need to zext/sext?

4121

Same comment as above. Why do you need to zext/sext?

4151

Is there a restriction on the placement of the insert_element instructions? Do they need to come from the same basic block?

4206

Please add more comments. I don't understand what's going on here.

hulx2000 marked 3 inline comments as done.Sep 25 2015, 2:40 PM

Just saw comments from Hal and Nadav.

For Hal's comments:

  1. If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.
  2. I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

For the rest of coding comments, I will address it with another patch update.

Thanks

lib/Transforms/Vectorize/SLPVectorizer.cpp
68

I can remove that two flags.

I did measure our internal benchmark, I did see wins, will run performance measurement on llvm test suite.

Just saw comments from Hal and Nadav.

For Hal's comments:

  1. If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.

No, you'd replace them all with the corresponding extract. What am I missing?

  1. I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.

apazos added a subscriber: apazos.Sep 25 2015, 4:13 PM

Hi folks,

Just want to clarify where this issue comes from:

  1. SROA will replace large allocas with vector SSA values.

E.g., alloca "short a [32]" is rewritten as 4 vectors of type <8 x i16> to avoid the load/stores to the stack-allocated variable.
This results in insert/extract instructions being generated in the IR code.

  1. The AArch64 backend is not able to combine scattered loads and stores with the insert/extract instructions to generate scalar/lane-based loads/stores in the presence of extension instructions.

Example 1: When there no extension/truncation of the loaded values we are fine, the backend generates optimized code.
x = ld
y = insert x v1, 1
Generates:
ld1 { v0.b }[1], [x0]

Example 2: But when extension instructions are present:
x = ld
y = ext x
z = insert y v1, 1
Generates:
ldrb w8, [x0]
ins v0.h[1], w8

However this is better code:
ld1 { v0.b }[1], [x0]
ushll v0.8h, v0.8b, #0

You notice it is better code when you have more than one insert instruction:
ldrb w8, [x0]
ldrb w9, [x1]
ins v0.h[1], w8
ins v0.h[5], w9
Better code would be:
ld1 { v0.b }[1], [x0]
ld1 { v0.b }[5], [x1]
ushll v0.8h, v0.8b, #0

The same is true for extract instructions:

umov  w8, v0.b[1]
umov  w9, v0.b[5]
strh   w8, [x0]
strh   w9, [x1]

Better code would be:

ushll v0.8h, v0.8b, #0
st1 { v0.h }[1], [x0]
st1 { v0.h }[5], [x1]

Therefore after SROA we need to detect these patterns in the IR and fix the IR code so the backend can generate the optimized instructions.

This should be done target-independent. Maybe it can be done in Inst Combine, or SLP vectorizer (as in this patch).

Even though it is SROA who is generating the insert/extract instructions, I do not think we should fix it there.

This is the problem Lawrence is trying to solve. Any other suggestion?

Just saw comments from Hal and Nadav.

For Hal's comments:

  1. If the original ext is used more than once, then the original ext can't be deleted after my transformation, so it may not gain anything, that's why I check hasOneUse() on it.

No, you'd replace them all with the corresponding extract. What am I missing?

If any of the original ext is used more than once, then it can't be deleted even though I will insert a vector ext instruction later, that may make this transformation not beneficial, of course a cost model would be better, but I didn't do it because the performance gain of this is not big, adding a complicate cost mode may not justify it.

  1. I agree, this transformation is designed for AArch64, so I could make it AArch64 specific.

For Navav's comment "We are already doing these kind of optimizations in SelectionDAG. The SLPVectorizer is not the right place for this kind of transformation", do you mean I shouldn't do this (my) transformation in SLPVectorizer? At least for our case, SelectionDAG is unable to catch it, and it caused a performance loss.

Why is it not able to catch it? We need to understand that before we move forward with adding handling in the SLP vectorizer for this.

I will have to investigate that, didn't know SelectionDAG can handle this until now.

jmolloy resigned from this revision.Dec 12 2015, 5:25 AM
jmolloy removed a reviewer: jmolloy.

Resigning from this - it's stale.