This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Avoid interleaved SIMD store instructions for Exynos
ClosedPublic

Authored by az on Sep 22 2017, 3:11 PM.

Details

Summary

Interleaved SIMD store instructions can be very costly on certain targets such as Exynos. For such instructions, we can break the inefficient instructions into multiple instructions after checking on the latency of the replacement instructions.

For example, the instruction
st2 {v0.4s, v1.4s}, addr
can be replaced by
zip1 v2.4s, v0.4s, v1.4s
zip2 v3.4s, v0.4s, v1.4s
stp q2, q3, addr

Diff Detail

Event Timeline

az created this revision.Sep 22 2017, 3:11 PM

Hi Abderrazek,

I've added a few comments, but ran out of time looking through the full patch.
Hopefully this is already useful to you.

Thanks!

Kristof

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
71–74

In other parts, comments state "Certain SIMD instructions with vector element operand are not efficient."
In this enum, FMLA is used for that situation.
It seems that either the name of the enum value should be changed, or the comments be updated?

81–82

Maybe the comment should indicate why a limit must be placed on the maximum number of instructions replaced & why 10 is a good value?

93–110

I guess as is, the comment cannot be fully true anymore?
Either this checks also the "Interleave" patterns, or this optimization does spend some compile time on this pass for a Target that does not need this optimization?

94

The comment states "InstDescRep1", but the argument name in the code is "InstDescRepl"

160

s/InstDescRep1/InstDescRepl/?

kristof.beyls added inline comments.Oct 3 2017, 8:31 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
159–160

I guess there is a good reason why a vector cannot be used here instead of a pointer and a separate NumRepl argument?

172–173

This probably should be a SmallVector and the constant MaxNumRepl should disappear?

172–175

I think it's clearer if the above 2 loops would be fused, making the cryptic variable array variable SCIdxRepl unnecessary.

175

Similarly, this probably needs to be a SmallVector?

186–191

My guess is that this logic would also be clearer if it would be fused with the previous loop populating SCDescRepl[i].

216–219

As pointed out earlier, a vector/SmallVector rather than an array would result in not needing to manually calculate and specify the length of the array here?

222

if this function would us a switch to handle different enum values, the assert here wouldn't be needed; having just an assert or llvm_unreachable on the default case. I think that's more idiomatic in the LLVM code base.

520

typo

az updated this revision to Diff 118297.Oct 9 2017, 4:55 PM

Kristof, Thank you for the feedback. Very useful comments to make the patch more readable and more maintainable.
I addressed your comments with the new version.

az marked 13 inline comments as done.Oct 9 2017, 4:59 PM

Hi Abderrazek,

Thanks for the new version, looks much better IMHO. I'm afraid I might not find time soon to look at all the details, but I've added a few more comments on style.

Thanks,

Kristof

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
106–107

duplicated line in comment.

160–166

typos

222

According to the coding standard, sp should probably be SP?

592–654

I think this code might get easier to understand/maintain if the prepareStmtParam call happened after the switch, and in the switch, there would only be functionality where the different case statements actually differ. e.g. something like:

switch (...) {
  case A:
     Zip1 = Zip1v16i8; Zip2=ZIP2v16i8; RC=RC64; NumVec = FourVectors;
  case B:
     Zip1 = Zip1v8i8; Zip2=ZIP2v8i8; RC=RC64; NumVec = FourVectors;
  ...
}
prepareStmtParam(Zip1, Zip2, RC, NumVec)

Or maybe it would be even easier if this switch would be pulled apart into 4 switch statements, each setting one of Zip1/Zip2/RC/NumVec variables in the example above?
What do you think?

827–871

This looks like a lot of copy paste.
Couldn't this be rewritten to be a loop over the optimizers in this pass, something like (hand-wavy):

for (auto OptimizationKind : {VectorElem, Interleave}) {
...
}

?

az updated this revision to Diff 118716.Oct 11 2017, 4:11 PM

Kristof, thanks again for the new feedback. Uploaded a new version.

az marked 3 inline comments as done.Oct 11 2017, 4:22 PM
az added inline comments.
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
592–654

Went with the first suggestion. The code already has two switch statements to handle common code and non-common code. Prefer not subdivide more.

Hi Abderrazek,

Thanks for the new version, looks much better again.
I have made a few more point remarks.
But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

Thanks!

Kristof

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
1

I feel like the name of this pass/file should be changed, as the name no longer covers what it does.
It seems the pass now is becoming about replacing certain SIMD patterns with longer but more efficient SIMD patterns.
AArch64SIMDOptimizer might fit also with some of the file names of other optimizer passes already in the AArch64 target?

74

s/unsupportedSub/UnsupportedSub/.
Do you need the UnsupportedSub value?

80

Do you need the unsupportedAcc value?

159–160

"Return true if replacement is expected to be faster" instead?
Also, I wouldn't duplicate the comments on both the function declarations and definitions - it's hard to keep them consistent.

160–166

"replacement"

221

It seems that "VectElement" isn't relevant in this function name anymore. Maybe it should become "shouldExitEarly" or something similar?

832–835

bool InstRewrite can be declared inside this for loop; no need to declare it higher up.

az updated this revision to Diff 118910.Oct 13 2017, 7:58 AM
az marked 7 inline comments as done.Oct 13 2017, 8:03 AM

But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

As llvm is now getting better at generating interleaved memory access instructions, we are seeing a major degradation in a proprietary benchmark for Exynos because latest llvm generates st2 instructions in a hot loop. I can refer you to the llvm file lib/Target/AArch64/AArch64SchedM1.td for Exynos latencies (example: st2 takes 8 cycles, while zip1, zip2, and stp each take 1 cycle).

Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
1

I was planning to clean the naming afterward in a separate patch but I just did it now after seeing this Comment. I Changed classes names, pass name, variable name, etc. (as a result, more files with minor changes are included in the review related to changing the pass name). Note that I did not change the file name in here but will do when I commit the code. I chose AArch64SIMDInstrOpt instead of AArch64SIMDOptimizer based on teh assumption that SIMDOptimizer may be confuse people into thinking that we may be doing some vectorization related work. SIMDInstr may convey the idea that we are doing optimization on the generated SIMD instructions. Let me know if you have other ideas about this.

74

I used to have an assert statement to look for enum value error at the top of the function (see original review). Then, I moved that error checking into the default case of the switch statement (with llvm_unreachable()) based on my understanding of one of your suggestions. That resulted in a compiler warning during the build related to the no need for the default case as all values are covered. So, I added the Unsupported in enum. In this new review, I changed the names of the extra enum value just in case it looks less confusing. While I prefer that, I am perfectly fine with removing them along with the default case and any enum related error checking.

In D38196#896905, @az wrote:

But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

As llvm is now getting better at generating interleaved memory access instructions, we are seeing a major degradation in a proprietary benchmark for Exynos because latest llvm generates st2 instructions in a hot loop. I can refer you to the llvm file lib/Target/AArch64/AArch64SchedM1.td for Exynos latencies (example: st2 takes 8 cycles, while zip1, zip2, and stp each take 1 cycle).

Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?

Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).

I had a very brief glance and the latencies specified for ST4 in AArch64SchedA57.td. They didn't seem wrong at a glance. There's a chance you saw some other effects? I'll try to find some time to check those instruction latencies for Cortex-A57 more closely.
More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Other than the above I don't think I've got other major concerns left-over.
Thanks very much for your prompt replies and actions!

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
1

Thanks, I think AArch64SIMDInstrOpt is a reasonable name for this.

74

My feel is that if the "NA/UnsupportedSub" enum value is only used to assert, it doesn't serve a practical purpose and therefore it should be removed?

Rereading my previous review comment that made you add a default case: I agree there shouldn't be a default case
(Furthermore, https://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations suggests not using "default" in switch statements when not necessary.)

Wouldn't the compiler warn when trying to set the Subpass enum to a value different from VectorElem/Interleave, and hence there isn't a need for asserts on this enum value?

az marked 2 inline comments as done.Oct 25 2017, 8:05 AM

Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?

Yes, there is no regression in the llvm test-suite.

Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).

I had a very brief glance and the latencies specified for ST4 in AArch64SchedA57.td. They didn't seem wrong at a glance. There's a chance you saw some other effects? I'll try to find some time to check those instruction latencies for Cortex-A57 more closely.

Unless there is something wrong with my experimental setup, I am confirming that I see a problem related to the ST4 latencies. I am experimenting now with timing the following one-line loop that has an st4 instruction in it:
#pragma nounroll
for (i=0; i<10000000; i++)

vst4q_s32 (ptr, st);

With my patch, no change to the st4 instruction under -mcpu=cortex-a57 because of the advertised latency. If I change my patch (by commenting calls to shouldExitEarly and shouldReplaceInstruction) so that we do not check latencies and do always the replacement, I get significant improvement.

For the ST2 experiment (using vst2q_s32()), my results are not conclusive given that results are close but there is chance that this has a small problem as well.

More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Agree but when the original pass is written, there is a concern from another target without the problem that we should a have a minimal check before really entering the pass. So, I am doing the smallest check that shows the problem. I have no problem doing both checks or doing the more expensive ST4 check only so that it can trigger the pass for ARM and Exynos. In case you are getting back to us soon about the ARM latency, then let's wait until then and we can then decide what to have in early exit code.

Other than the above I don't think I've got other major concerns left-over.
Thanks very much for your prompt replies and actions!

Just a quick remark/question before I go off-line for the rest of today:

In D38196#906627, @az wrote:

More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Agree but when the original pass is written, there is a concern from another target without the problem that we should a have a minimal check before really entering the pass. So, I am doing the smallest check that shows the problem. I have no problem doing both checks or doing the more expensive ST4 check only so that it can trigger the pass for ARM and Exynos. In case you are getting back to us soon about the ARM latency, then let's wait until then and we can then decide what to have in early exit code.

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

az added a comment.Oct 25 2017, 10:14 AM

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

Your understanding of ShouldExitEarly is accurate and I agree that ideally we should put down all rewriting rules. However, ShouldExitEarly is not very cheap as it calls shouldReplaceInstruction. The question is should we put some heuristics into ShouldExitEarly so that we do not check for all rules even though the number of rules is static and does not depend on code? I currently check for one rule and make it a representative of all (st2 in this case) and this may be over-simplification. Maybe, I can check for one representative of st2 and one for st4 (which both have many variants).

javed.absar added inline comments.Oct 26 2017, 1:58 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
184

nitpick. Please replace with range_loop (unless there is a strong reason not to)

In D38196#906795, @az wrote:

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

Your understanding of ShouldExitEarly is accurate and I agree that ideally we should put down all rewriting rules. However, ShouldExitEarly is not very cheap as it calls shouldReplaceInstruction. The question is should we put some heuristics into ShouldExitEarly so that we do not check for all rules even though the number of rules is static and does not depend on code? I currently check for one rule and make it a representative of all (st2 in this case) and this may be over-simplification. Maybe, I can check for one representative of st2 and one for st4 (which both have many variants).

Hmmmm.... this seems to be a tough one to find the right tradeoffs....
My personal opinions/observations:

  1. Making ShouldExitEarly stop the whole pass even if some of the optimizations would trigger is a hack that makes this pass operate in a non-obvious way and that will bite us in the future.
  2. I guess there's a chance even more peephole-y patterns will be added to this pass in the future, which would make ShouldExitEarly even more expensive to check for all rewriting rules implemented. But also, if that function would take short cuts and heuristics, the pass would operate in more surprising and non-obvious ways.

The above made me wonder if to keep compile-time under control, it wouldn't be better to integrate this in one of the existing frameworks for peepholes rather than in a pass of its own. And then I looked at D21571 where this pass got introduced and saw such approaches were probably tried and rejected in early version of that review.
Right now, my best guess of the easiest way forward is to make ShouldExitEarly compute the profitability of all rewrite rules implemented, and let it cache the results of that analysis for each subtarget it encounters. That way, the cost of this computation should be amortized out over all functions the pass will handle, resulting in not having too much impact on compile time?
Having said that, I don't know of another pass that has a caching mechanism like this. But I also haven't tried looking for such an example. I'm not sure if there's going to be a gotcha in trying to cache those results somehow.
What do you think?

Thanks,

Kristof

az updated this revision to Diff 120912.Oct 30 2017, 4:34 PM

A version that addresses the latest comments and in particular a more complete version of shouldExitEarly().

az marked 2 inline comments as done.Oct 30 2017, 4:36 PM
az added a comment.Oct 30 2017, 5:26 PM

Hmmmm.... this seems to be a tough one to find the right tradeoffs....
My personal opinions/observations:

  1. Making ShouldExitEarly stop the whole pass even if some of the optimizations would trigger is a hack that makes this pass operate in a non-obvious way and that will bite us in the future.
  2. I guess there's a chance even more peephole-y patterns will be added to this pass in the future, which would make ShouldExitEarly even more expensive to check for all rewriting rules implemented. But also, if that function would take short cuts and heuristics, the pass would operate in more surprising and non-obvious ways.

The above made me wonder if to keep compile-time under control, it wouldn't be better to integrate this in one of the existing frameworks for peepholes rather than in a pass of its own. And then I looked at D21571 where this pass got introduced and saw such approaches were probably tried and rejected in early version of that review.
Right now, my best guess of the easiest way forward is to make ShouldExitEarly compute the profitability of all rewrite rules implemented, and let it cache the results of that analysis for each subtarget it encounters. That way, the cost of this computation should be amortized out over all functions the pass will handle, resulting in not having too much impact on compile time?
Having said that, I don't know of another pass that has a caching mechanism like this. But I also haven't tried looking for such an example. I'm not sure if there's going to be a gotcha in trying to cache those results somehow.
What do you think?

Thanks,

Kristof

I put a new version of the code that has all rewriting rules in shouldExitEarly(). I left the old one and name it ShouldExitEarlyShort() and added one rule to it representing ST4 just in case we go with it by the end (will remove one of them during final review). You are are right when you say that more patterns will be added in the future even for this pass (I have not done st3, and post increment store yet). So, the list will at least double.

I have not looked into the caching mechanism you are suggesting yet. Do you think it is acceptable coding standard for llvm to have such global data structure used between function units? It would help tremendously even though I think people with a sub-target that do not have a need for this pass will mind all these rewriting rules even if called one time.

az updated this revision to Diff 120994.Oct 31 2017, 9:34 AM

Made caching mechanism for instruction replacement decisions to work across functions (instead of within function only) by declaring SIMDInstrTable as a global variable.

kristof.beyls added inline comments.Nov 1 2017, 12:49 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
154–157

Making this a global variable indeed very probably isn't acceptable.
However, I was thinking to have this be a member variable of the AArch64SIMDInstrOpt MachineFunctionPass.
Wouldn't that still allow to cache/reuse this table between different calls of "runOnMachineFunction"?
I still haven't had time to check if this pattern shows up anywhere else in any other Pass.

az updated this revision to Diff 121389.Nov 2 2017, 3:10 PM
az marked an inline comment as done.
kristof.beyls added inline comments.Nov 3 2017, 8:20 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
163–170

Is the only reason you need to drop const from shouldReplaceInst and other methods that you're using operator[] here to get to an element that's guaranteed to be in the map?
If so, wouldn't it be better to use syntax "*SIMDInstrTable.find(InstDesc->getOpcode())", so that the const can remain on all those methods?

333–353

From a logical point of view, this is a lot of code that repeats information encoded in a different way in the big switch statements in optimizeLdStInterleave.
It'll be easy to update one location in the future and forget the other location.
Couldn't the logic in the switch statements in optimizeLdStInterleave be reused here somehow, so that the information is only encoded once?

kristof.beyls added inline comments.Nov 3 2017, 9:25 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
71–73

The cache here will also need to take into account the TargetSubtargetInfo as the key into the cache.
Otherwise, wouldn't you get unexpected results for e.g. the following .ll file with a mixture of targeting different subtargets (see "target-cpu" function attributes)?

declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)

define <2 x float> @test_vfma_lane_f32_a57(<2 x float> %a, <2 x float> %b, <2 x float> %v)
"target-cpu"="cortex-a57" {
entry:
  %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
  ret <2 x float> %0
}

define <2 x float> @test_vfma_lane_f32_exynos(<2 x float> %a, <2 x float> %b, <2 x float> %v)
"target-cpu"="exynos-m1" {
entry:
  %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
  ret <2 x float> %0
}

I have no idea how to get a stable key from TargetSubtargetInfo to use in this cache.

At this point, I'd measure the overhead of this pass e.g. using CTMark to make sure that the caching mechanism is actually needed before investing more time in it.

az updated this revision to Diff 121818.Nov 6 2017, 5:06 PM

Adding subtarget as part of key and simplifying some code.

az added inline comments.Nov 6 2017, 5:09 PM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
71–73

I am adding subtarget as part of key.

163–170

Yes, I was having trouble setting SIMDInstrTable[...] to true or false (few lines below the line of code you mention above). It thinks that the map member variable I added to the const class has the constant modifier too.

333–353

Good suggestion to simplify that code. There are several ways to improve it and here is one way: store the re-writing rules in a table member variable of the class and use that table in shouldExitEarly and in optimizeLdStInterleave. Note that I have not changed optimizeLdStInterleave yet (only used new table in shouldExitEarly). I would like to make sure that what I did is the kind of changes you had originally in mind and if so, then I will use it in optimizeLdStInterleave.

kristof.beyls added inline comments.Nov 7 2017, 6:57 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
71–73

Thanks! Could you also add a regression test similar to the LLVM-IR I wrote in my original comment about this to make sure this functionality is tested?

163–170

Thanks, that makes sense to me now.

235–246

I like the table-driven approach a lot better - the code is a lot cleaner, easier to read and easier to maintain.
However, wouldn't it be even cleaner to not let the code behave special based on TwoVectors/FourVectors?
I guess this would mean spelling out the whole sequence of replacement instructions in the table, instead of using the 'for (unsigned j=0; j<4, j++)' loop here, and it would make the table larger?
I still guess that may be the right tradeoff to make, as it will make it easier to understand what the patch does from just reading the table.
And I guess it will also make it easier to later add support for other replacement patterns?

333–353

Yes, this is the kind of change I had in mind.
I didn't try to figure out if everything could be driven from such a table, but what you wrote in shouldExitEarly is roughly what I had in mind (with a follow-on nit-pick about making the code even more data-driven in my other comment on the new code).

az added inline comments.Nov 7 2017, 7:30 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
333–353

Thank you for the feedback. I will work on the table approach in a day or two. In the meantime, I will refer you to this abandoned patch https://reviews.llvm.org/D27738 where I tried to add a table driven approach to the original pass. It was seen as adding complexity. The tables in this review are for different purpose and I think that we should go with it but I just want to let you about the old attempt.

az updated this revision to Diff 122464.Nov 10 2017, 9:43 AM

Code reuse and simplification due to the use of a table that has instruction replacement info.

az marked an inline comment as done.Nov 10 2017, 9:49 AM
az added inline comments.
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
71–73

Added your test as is (with a change to function names).

235–246

Removed the TwoVectors/FourVectors fields and enum.

I haven't gone through the whole patch line by line again, but I believe this is almost in a reasonable shape.
Could you do the compilation time benchmarking to ensure this doesn't increase it for target not needing the rewriting?

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
80

s/representd/represented/

87–92

Maybe "RuleST2" instead of "Rule1" and "RuleST4" instead of "Rule2" is a little bit more descriptive?

234–238

This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function.
I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*?
I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).

az marked an inline comment as done.Nov 16 2017, 11:27 AM
az added inline comments.
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
234–238

Here are some compile time numbers I got by running CTMark. Every number represents the best of three runs. The first two columns are the ones that you asked for. The last two columns show the results when the whole pass is disabled. Even though there are some trends such as the A57 compile time numbers in column2 are in general lower than the numbers in column1, the difference is usually small that it can be considered noise.

Let me know if you have any more comments.

Benchmark Exynos-m1 cortex-a57 exynos-m1 (pass disabled) cortex-a57 (pass disabled)
7zip 123.7240 123.1800 122.6360 122.6640
Bullet 82.6360 82.1200 82.0840 82.3680
ClamAV 51.3680 51.1360 50.3680 50.7920
consumer-typeset 34.4440 34.2800 34.3760 34.3800
kimwitu++ 34.3680 34.2480 34.6080 34.2800
lencod 49.8160 49.7760 49.2880 49.2560
mafft 23.0160 23.0360 23.1760 23.0200
SPASS 42.2720 42.1440 41.8800 41.7720
sqlite3 25.0840 25.1520 24.9920 25.1200
tramp3d-v4 37.1520 37.4240 37.4560 37.0240
kristof.beyls added inline comments.Nov 20 2017, 12:56 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
234–238

Thanks for the data!
I've added some geomean calculations:

BenchmarkExynos-m1cortex-a57exynos-m1(pass disabled)cortex-a57(pass disabled)
7zip123.724123.18122.636122.664
Bullet82.63682.1282.08482.368
ClamAV51.36851.13650.36850.792
consumer-typeset34.44434.2834.37634.38
kimwitu++34.36834.24834.60834.28
lencod49.81649.77649.28849.256
mafft23.01623.03623.17623.02
SPASS42.27242.14441.8841.772
sqlite325.08425.15224.99225.12
tramp3d-v437.15237.42437.45637.024
GEOMEAN44.1409242544.0684470843.970072343.90935266
COMPARED TO DISABLED100.39%100.36%

So, compared to having the pass disabled, both when targeting Exynos-m1 and Cortex-A57, it seems this adds about 0.4% to the compile time.
If I've understood correctly from earlier comments, at the moment, for Cortex-A57, this pass isn't expected to make changes to generated code.
I guess that on this set of benchmarks, even for Exynos-m1, this pass may not make many, if any, changes?
If so, adding 0.4% compile time for not changing the program output seems a bit high to me.
@mcrosier : since you were concerned about compile time of this pass before, what do you think?

az marked 2 inline comments as done.Nov 21 2017, 9:35 AM

i

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
234–238

0.4% may be a bit high but I am kind of see it within the noise range (for example and on few benchmarks, the difference between the low and the high is sometimes bigger than 0.4%). Having said that about the accuracy of the benchmark, I think that we should reduce the number of entries in the table used by the function shouldExitEarly which determines whether we should go through the pass or not. In the original patch, I checked on 1 rewrite rule only to enter the pass. The current version checks on every rewrite rule. But, some are redundant if we consider how the hardware in general implements these instructions. For example such as 128-bit st2, we check the rewrite rules and access latency for st2.4s, st2.8h, and st2.16h. I believe that we can check for only one instead of all three because the hardware implements these 3 instructions in a similar way. I reduced the table from 14 entries to just 4 and I am getting these numbers for cortex-a57 with and without the pass:

Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best)
7zip 122.1760 123.1520 123.2600 122.17 122.4800 123.2520 123.7160 122.48
Bullet: 83.2120 82.6360 82.6160 82.61 82.9200 82.6880 82.6880 82.68
ClamAV: 51.3040 51.6800 51.5520 51.30 50.9000 51.1160 51.1320 50.90
consumer-typeset: 34.5240 34.4960 34.5600 34.49 34.2480 34.5760 34.7560 34.24
kimwitu++: 34.5200 34.3800 34.4000 34.38 34.6960 34.3600 34.6480 34.36
lencod: 50.1840 50.0480 50.1040 50.10 49.9400 50.0160 49.8940 49.89
mafft: 23.1560 23.3640 23.4320 23.15 23.3800 23.2920 23.3880 23.29
9 SPASS: 41.8960 42.0240 42.0440 41.89 42.2440 42.0120 41.8920 41.89
sqlite3: 25.2840 25.1520 25.3520 25.15 25.2520 25.3760 25.4280 25.25
tramp3d-v4: 39.1200 37.3160 37.4800 37.31 37.6840 40.0840 37.5480 37.54
Geomean: 44.21158 44.12472
COMPARED TO DISABLED: 100.1968%

Let me know if you would like to see the patch for that.

kristof.beyls added inline comments.Nov 27 2017, 5:31 AM
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
234–238

It's an option to try and reduce the number of instruction checked in the function shouldExitEarly, but I think that should only be considered as a last resort, as it makes the behaviour of the pass potentially "strange" if on some strange micro-architecture in the future, rewriting rules should be applied differently for e.g. st2.4s, st2.8h and st2.16h. I agree that's unlikely, but not impossible.
I think one more thing to try and do first is to reduce the amount of work done in shouldExitEarly when the results should already by cached.
The code for shouldExitEarly at the moment is:

bool AArch64SIMDInstrOpt::shouldExitEarly(MachineFunction *MF, Subpass SP) {
  const MCInstrDesc* OriginalMCID;
  SmallVector<const MCInstrDesc*, MaxNumRepl> ReplInstrMCID;

  switch (SP) {
  case VectorElem:
    OriginalMCID = &TII->get(AArch64::FMLAv4i32_indexed);
    ReplInstrMCID.push_back(&TII->get(AArch64::DUPv4i32lane));
    ReplInstrMCID.push_back(&TII->get(AArch64::FMULv4f32));
    if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
      return false;
    break;
  case Interleave:
    for (auto &I : IRT) {
      OriginalMCID = &TII->get(I.OrigOpc);
      for (auto &Repl : I.ReplOpc)
        ReplInstrMCID.push_back(&TII->get(Repl));
      if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))
        return false;
      ReplInstrMCID.clear();
    }
    break;
  }

  return true;
}

In the current implementation of this patch, even when all results are already cached, this still results in the for loop in "case Interleave" to still be traversed completely, IIUC, as the caching of results happens inside shouldReplaceInst. I think that it would be better to instead of, or additionally, cache the results at the "shouldExitEarly" level, i.e. cache if shouldExitEarly should return "true" or "false", so that it wouldn't have to traverse the for loop if on a previous invocation, the answer was already "true". Of course this need to take into account the subtarget targeted by the machine function.
My guess is that this could have enough impact on compile time reduction that no other changes would be needed.
What do you think?

az updated this revision to Diff 124837.Nov 29 2017, 3:40 PM
az marked an inline comment as done.

Added caching mechanism to save decisions per target of whether to go into the interleaved store replacement pass or not.

az marked 2 inline comments as done.Nov 29 2017, 4:19 PM
az added inline comments.
llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
234–238

It seems that the caching mechanism to save the decision per target of whether to enter the instruction replacement pass really improved compile time. Thank you for the idea. I am including one set of runs below where cortex-a57 with pass is slightly worse in terms of compile time than cortex-a57 without the pass. In order to verify on this, I did few other sets of runs (not shown here) and the difference is usually very small including having a set of runs where with "the pass" outperforms "without the pass" in terms of compile time. It seems that the overhead of ShouldExitEarly is really now within the noise range.

Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best)
7zip 122.4480 123.0200 122.7160 122.4480 123.0440 122.7360 123.5120 122.7360
Bullet 82.3960 82.2520 83.1720 82.2520 82.7040 82.2640 83.5840 82.2640
ClamAV 51.6320 51.0560 50.6320 50.6320 51.2920 51.3520 51.0040 51.0040
consumer-typese 34.1440 34.5600 34.3360 34.1440 34.7760 33.9560 34.6080 33.9560
kimwitu++ 34.3360 34.4760 34.5440 34.3360 34.9320 34.6400 34.5960 34.5960
lencod 49.9000 49.3280 50.3160 49.9000 49.6480 49.3800 50.0320 49.3800
mafft 23.5080 23.2760 23.0200 23.0200 23.3120 23.2120 23.6120 23.2120
SPASS 42.3160 42.0920 42.3040 42.0920 41.9560 42.2640 41.8040 41.8040
sqlite3 25.3840 26.3680 26.2400 25.3840 25.2080 25.2400 25.1760 25.1760
tramp3d-v4 37.5680 37.0600 37.2520 37.0600 37.3760 37.1960 37.1760 37.1760
Geomean 44.00964 43.98918
Compared to disabled 100.0465%
kristof.beyls accepted this revision.Dec 5 2017, 5:51 AM

Thanks for your patience with my long string of review comments.
With just still addressing the few more nitpicks, I think this will have gotten to an acceptable state.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
74–77

std::unordered_map is probably sufficient and slightly faster than std::map?

238–239

It seems that the !InterlEarlyExit.empty() condition is not needed here?

This revision is now accepted and ready to land.Dec 5 2017, 5:51 AM
az added a comment.Dec 11 2017, 4:51 PM

Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.

I would like also to get your feedback on a couple of things:

  1. Is it worth replacing how code generation (i.e. building new instructions) is implemented in this pass? It is currently hard-coded but we can replace it by a table driven approach similar to the one used in the analysis phase (see IRT Table). In particular, I need to find a simple way to express through a table or other means how the new instructions are built out of the old one and more specifically how the operands of new instructions related to old ones. This adds complexity to the implementation and understanding of the code generation but makes adding more rewrite rules quite simple. It tried this before but the patch was not accepted but may be I should bring it again now that we added several new rewrite pattern in this patch and expect to add more for ST3.
  1. I still need your investigation about the latency for st4/st2 on A57. Based on my unit test case (which is posted somewhere in the comments of this patch), st4 is inefficient for cortex-a57 and the replacement instructions in this patch perform better. If you or somebody from ARM can verify on that, then this means that the latency information in A57 .td file is not accurate and we need to update it. My test for st2 is not conclusive related to whether st2 is better than the replacement instructions or not. Note that, we have few benchmarks that are compiled with cortex-a57 and we are not supposed to change that flag. It would be nice if the A57 .td file has accurate information.
az marked an inline comment as done.Dec 11 2017, 4:54 PM
az marked an inline comment as done.Dec 11 2017, 4:57 PM
In D38196#951927, @az wrote:

Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.

I would like also to get your feedback on a couple of things:

  1. Is it worth replacing how code generation (i.e. building new instructions) is implemented in this pass? It is currently hard-coded but we can replace it by a table driven approach similar to the one used in the analysis phase (see IRT Table). In particular, I need to find a simple way to express through a table or other means how the new instructions are built out of the old one and more specifically how the operands of new instructions related to old ones. This adds complexity to the implementation and understanding of the code generation but makes adding more rewrite rules quite simple. It tried this before but the patch was not accepted but may be I should bring it again now that we added several new rewrite pattern in this patch and expect to add more for ST3.

I only have some hand-wavy high-level thoughts/opinions on this.
I see this pass, at a high level, to be about rewriting instruction sequences into sequences that are faster, when the info about instruction latency suggests that should be the case.
I would expect that this could also be beneficial for other, non-AArch64, targets. If that proves to be the case; and if there prove to be quite a few rewriting rules, I think the ideal end game is being able to represent potential rewriting rules per instruction set in some kind of TableGen syntax.
That would allow to keep the rewriting rules specified in a concise, easily understandable and maintainable way.

Going for that design right now seems premature to me: it isn't proven yet that there will be many useful such rewriting rules and it hasn't been proven yet that this is also useful for other instruction sets.
What the current implementation has proven, I think, is that it's possible to implement this functionality in a way such that it doesn't negatively impact compile time too much unnecessarily.

Would it be worth to try and take some steps when adding support for the ST3 pattern towards that hypothetical Tablegen end goal?
I'm not sure - it depends on how many more rewrite rules we expect will get added over time.
Do you happen to have any ideas of what other rewriting rules might plausibly be useful?
Or a method of somehow mining the latency information to somehow automatically suggest instruction sequences that might be worthwhile to consider for rewriting in a different way?

  1. I still need your investigation about the latency for st4/st2 on A57. Based on my unit test case (which is posted somewhere in the comments of this patch), st4 is inefficient for cortex-a57 and the replacement instructions in this patch perform better. If you or somebody from ARM can verify on that, then this means that the latency information in A57 .td file is not accurate and we need to update it. My test for st2 is not conclusive related to whether st2 is better than the replacement instructions or not. Note that, we have few benchmarks that are compiled with cortex-a57 and we are not supposed to change that flag. It would be nice if the A57 .td file has accurate information.

As far as I know, the latency information for Cortex-A57 is correct in the .td file. There's a good chance that in your unit test you might be seeing the effect of another micro-architectural limiting factor than latency, e.g. resource usage, that this pass doesn't take into account when making decisions.