This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64.h
-
AArch64TargetMachine.cpp
33/54
AArch64VectorByElementOpt.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-neon-2velem.ll
-
arm64-st1.ll

Differential D38196

[AArch64] Avoid interleaved SIMD store instructions for Exynos
ClosedPublic

Authored by az on Sep 22 2017, 3:11 PM.

Download Raw Diff

Details

Reviewers

rengolin
evandro
t.p.northover
mcrosier
kristof.beyls

Summary

Interleaved SIMD store instructions can be very costly on certain targets such as Exynos. For such instructions, we can break the inefficient instructions into multiple instructions after checking on the latency of the replacement instructions.

For example, the instruction
st2 {v0.4s, v1.4s}, addr
can be replaced by
zip1 v2.4s, v0.4s, v1.4s
zip2 v3.4s, v0.4s, v1.4s
stp q2, q3, addr

Diff Detail

Event Timeline

az created this revision.Sep 22 2017, 3:11 PM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar, aemerson. · View Herald TranscriptSep 22 2017, 3:11 PM

az added a subscriber: llvm-commits.Sep 22 2017, 3:21 PM

ping

Hi Abderrazek,

I've added a few comments, but ran out of time looking through the full patch.
Hopefully this is already useful to you.

Thanks!

Kristof

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
70–73	In other parts, comments state "Certain SIMD instructions with vector element operand are not efficient." In this enum, FMLA is used for that situation. It seems that either the name of the enum value should be changed, or the comments be updated?
80–81	Maybe the comment should indicate why a limit must be placed on the maximum number of instructions replaced & why 10 is a good value?
153–170	I guess as is, the comment cannot be fully true anymore? Either this checks also the "Interleave" patterns, or this optimization does spend some compile time on this pass for a Target that does not need this optimization?
154	The comment states "InstDescRep1", but the argument name in the code is "InstDescRepl"
217	s/InstDescRep1/InstDescRepl/?

kristof.beyls added inline comments.Oct 3 2017, 8:31 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
216–217	I guess there is a good reason why a vector cannot be used here instead of a pointer and a separate NumRepl argument?
231–232	This probably should be a SmallVector and the constant MaxNumRepl should disappear?
231–234	I think it's clearer if the above 2 loops would be fused, making the cryptic variable array variable SCIdxRepl unnecessary.
234	Similarly, this probably needs to be a SmallVector?
245–250	My guess is that this logic would also be clearer if it would be fused with the previous loop populating SCDescRepl[i].
275–278	As pointed out earlier, a vector/SmallVector rather than an array would result in not needing to manually calculate and specify the length of the array here?
281	if this function would us a switch to handle different enum values, the assert here wouldn't be needed; having just an assert or llvm_unreachable on the default case. I think that's more idiomatic in the LLVM code base.
477	typo

Kristof, Thank you for the feedback. Very useful comments to make the patch more readable and more maintainable.
I addressed your comments with the new version.

az marked 13 inline comments as done.Oct 9 2017, 4:59 PM

Hi Abderrazek,

Thanks for the new version, looks much better IMHO. I'm afraid I might not find time soon to look at all the details, but I've added a few more comments on style.

Thanks,

Kristof

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
166–167	duplicated line in comment.
217–223	typos
281	According to the coding standard, sp should probably be SP?
549–611	I think this code might get easier to understand/maintain if the `prepareStmtParam` call happened after the switch, and in the switch, there would only be functionality where the different case statements actually differ. e.g. something like: switch (...) { case A: Zip1 = Zip1v16i8; Zip2=ZIP2v16i8; RC=RC64; NumVec = FourVectors; case B: Zip1 = Zip1v8i8; Zip2=ZIP2v8i8; RC=RC64; NumVec = FourVectors; ... } prepareStmtParam(Zip1, Zip2, RC, NumVec) Or maybe it would be even easier if this switch would be pulled apart into 4 switch statements, each setting one of Zip1/Zip2/RC/NumVec variables in the example above? What do you think?
704–748	This looks like a lot of copy paste. Couldn't this be rewritten to be a loop over the optimizers in this pass, something like (hand-wavy): for (auto OptimizationKind : {VectorElem, Interleave}) { ... } ?

Kristof, thanks again for the new feedback. Uploaded a new version.

az marked 3 inline comments as done.Oct 11 2017, 4:22 PM

az added inline comments.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
549–611	Went with the first suggestion. The code already has two switch statements to handle common code and non-common code. Prefer not subdivide more.

Hi Abderrazek,

Thanks for the new version, looks much better again.
I have made a few more point remarks.
But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

Thanks!

Kristof

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
0–1	I feel like the name of this pass/file should be changed, as the name no longer covers what it does. It seems the pass now is becoming about replacing certain SIMD patterns with longer but more efficient SIMD patterns. AArch64SIMDOptimizer might fit also with some of the file names of other optimizer passes already in the AArch64 target?
73	s/unsupportedSub/UnsupportedSub/. Do you need the UnsupportedSub value?
79	Do you need the unsupportedAcc value?
216–217	"Return true if replacement is expected to be faster" instead? Also, I wouldn't duplicate the comments on both the function declarations and definitions - it's hard to keep them consistent.
217–223	"replacement"
280	It seems that "VectElement" isn't relevant in this function name anymore. Maybe it should become "shouldExitEarly" or something similar?
709–712	bool InstRewrite can be declared inside this for loop; no need to declare it higher up.

az updated this revision to Diff 118910.Oct 13 2017, 7:58 AM

But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

As llvm is now getting better at generating interleaved memory access instructions, we are seeing a major degradation in a proprietary benchmark for Exynos because latest llvm generates st2 instructions in a hot loop. I can refer you to the llvm file lib/Target/AArch64/AArch64SchedM1.td for Exynos latencies (example: st2 takes 8 cycles, while zip1, zip2, and stp each take 1 cycle).

Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
0–1	I was planning to clean the naming afterward in a separate patch but I just did it now after seeing this Comment. I Changed classes names, pass name, variable name, etc. (as a result, more files with minor changes are included in the review related to changing the pass name). Note that I did not change the file name in here but will do when I commit the code. I chose AArch64SIMDInstrOpt instead of AArch64SIMDOptimizer based on teh assumption that SIMDOptimizer may be confuse people into thinking that we may be doing some vectorization related work. SIMDInstr may convey the idea that we are doing optimization on the generated SIMD instructions. Let me know if you have other ideas about this.
73	I used to have an assert statement to look for enum value error at the top of the function (see original review). Then, I moved that error checking into the default case of the switch statement (with llvm_unreachable()) based on my understanding of one of your suggestions. That resulted in a compiler warning during the build related to the no need for the default case as all values are covered. So, I added the Unsupported in enum. In this new review, I changed the names of the extra enum value just in case it looks less confusing. While I prefer that, I am perfectly fine with removing them along with the default case and any enum related error checking.

In D38196#896905, @az wrote:

But my main question at this point is: did you benchmark this? What are the results of that? On which cores/benchmarks?

As llvm is now getting better at generating interleaved memory access instructions, we are seeing a major degradation in a proprietary benchmark for Exynos because latest llvm generates st2 instructions in a hot loop. I can refer you to the llvm file lib/Target/AArch64/AArch64SchedM1.td for Exynos latencies (example: st2 takes 8 cycles, while zip1, zip2, and stp each take 1 cycle).

Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?

Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).

I had a very brief glance and the latencies specified for ST4 in AArch64SchedA57.td. They didn't seem wrong at a glance. There's a chance you saw some other effects? I'll try to find some time to check those instruction latencies for Cortex-A57 more closely.
More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Other than the above I don't think I've got other major concerns left-over.
Thanks very much for your prompt replies and actions!

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
0–1	Thanks, I think AArch64SIMDInstrOpt is a reasonable name for this.
73	My feel is that if the "NA/UnsupportedSub" enum value is only used to assert, it doesn't serve a practical purpose and therefore it should be removed? Rereading my previous review comment that made you add a default case: I agree there shouldn't be a default case (Furthermore, https://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations suggests not using "default" in switch statements when not necessary.) Wouldn't the compiler warn when trying to set the Subpass enum to a value different from VectorElem/Interleave, and hence there isn't a need for asserts on this enum value?

Thanks. I assume you confirmed there are no performance regressions at all in the test-suite?

Yes, there is no regression in the llvm test-suite.

Note that other targets that we tested did not in general have this problem. However, I noticed through unit testing that ARM A57 had this problem with st4 only (and not with st2) even though A57 latencies in AArch64SchedA57.td do not indicate any problem and hence this pass does not affect A57 (somebody from ARM may need to double check on those latency).

I had a very brief glance and the latencies specified for ST4 in AArch64SchedA57.td. They didn't seem wrong at a glance. There's a chance you saw some other effects? I'll try to find some time to check those instruction latencies for Cortex-A57 more closely.

Unless there is something wrong with my experimental setup, I am confirming that I see a problem related to the ST4 latencies. I am experimenting now with timing the following one-line loop that has an st4 instruction in it:
#pragma nounroll
for (i=0; i<10000000; i++)

vst4q_s32 (ptr, st);

With my patch, no change to the st4 instruction under -mcpu=cortex-a57 because of the advertised latency. If I change my patch (by commenting calls to shouldExitEarly and shouldReplaceInstruction) so that we do not check latencies and do always the replacement, I get significant improvement.

For the ST2 experiment (using vst2q_s32()), my results are not conclusive given that results are close but there is chance that this has a small problem as well.

More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Agree but when the original pass is written, there is a concern from another target without the problem that we should a have a minimal check before really entering the pass. So, I am doing the smallest check that shows the problem. I have no problem doing both checks or doing the more expensive ST4 check only so that it can trigger the pass for ARM and Exynos. In case you are getting back to us soon about the ARM latency, then let's wait until then and we can then decide what to have in early exit code.

Other than the above I don't think I've got other major concerns left-over.
Thanks very much for your prompt replies and actions!

Just a quick remark/question before I go off-line for the rest of today:

In D38196#906627, @az wrote:

More importantly this potentially being an issue on cores for ST4 only and not ST2 highlights the need for AArch64SIMDInstrOpt::shouldExitEarly to not bail out early if it finds the optimization isn't useful for ST2, but instead it should also check the other optimizations (e.g. ST4) this pass can do.

Agree but when the original pass is written, there is a concern from another target without the problem that we should a have a minimal check before really entering the pass. So, I am doing the smallest check that shows the problem. I have no problem doing both checks or doing the more expensive ST4 check only so that it can trigger the pass for ARM and Exynos. In case you are getting back to us soon about the ARM latency, then let's wait until then and we can then decide what to have in early exit code.

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

Your understanding of ShouldExitEarly is accurate and I agree that ideally we should put down all rewriting rules. However, ShouldExitEarly is not very cheap as it calls shouldReplaceInstruction. The question is should we put some heuristics into ShouldExitEarly so that we do not check for all rules even though the number of rules is static and does not depend on code? I currently check for one rule and make it a representative of all (st2 in this case) and this may be over-simplification. Maybe, I can check for one representative of st2 and one for st4 (which both have many variants).

javed.absar added inline comments.Oct 26 2017, 1:58 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
243	nitpick. Please replace with range_loop (unless there is a strong reason not to)

In D38196#906795, @az wrote:

I thought the "shouldExitEarly" method aims to check if any of the transformations the pass can do can be profitable for the target at all.
My understanding is that this can be checked by just examining the instruction latencies, not having to look at the actual code being compiled at all. I assume that the saving of compile time comes from the fact that "shouldExitEarly" needs to do an amount of work proportional to the number of rewriting rules implemented, not proportional to the amount of code being compiled.
So, for the pass to work as expected, i.e. always run when for the target at least one of the transformations is profitable, I think shouldExitEarly should check all rewriting rules implemented.

Does that make sense, or am I misunderstanding something here?

Your understanding of ShouldExitEarly is accurate and I agree that ideally we should put down all rewriting rules. However, ShouldExitEarly is not very cheap as it calls shouldReplaceInstruction. The question is should we put some heuristics into ShouldExitEarly so that we do not check for all rules even though the number of rules is static and does not depend on code? I currently check for one rule and make it a representative of all (st2 in this case) and this may be over-simplification. Maybe, I can check for one representative of st2 and one for st4 (which both have many variants).

Hmmmm.... this seems to be a tough one to find the right tradeoffs....
My personal opinions/observations:

Making ShouldExitEarly stop the whole pass even if some of the optimizations would trigger is a hack that makes this pass operate in a non-obvious way and that will bite us in the future.
I guess there's a chance even more peephole-y patterns will be added to this pass in the future, which would make ShouldExitEarly even more expensive to check for all rewriting rules implemented. But also, if that function would take short cuts and heuristics, the pass would operate in more surprising and non-obvious ways.

The above made me wonder if to keep compile-time under control, it wouldn't be better to integrate this in one of the existing frameworks for peepholes rather than in a pass of its own. And then I looked at D21571 where this pass got introduced and saw such approaches were probably tried and rejected in early version of that review.
Right now, my best guess of the easiest way forward is to make ShouldExitEarly compute the profitability of all rewrite rules implemented, and let it cache the results of that analysis for each subtarget it encounters. That way, the cost of this computation should be amortized out over all functions the pass will handle, resulting in not having too much impact on compile time?
Having said that, I don't know of another pass that has a caching mechanism like this. But I also haven't tried looking for such an example. I'm not sure if there's going to be a gotcha in trying to cache those results somehow.
What do you think?

Thanks,

Kristof

A version that addresses the latest comments and in particular a more complete version of shouldExitEarly().

az marked 2 inline comments as done.Oct 30 2017, 4:36 PM

Hmmmm.... this seems to be a tough one to find the right tradeoffs....
My personal opinions/observations:

Making ShouldExitEarly stop the whole pass even if some of the optimizations would trigger is a hack that makes this pass operate in a non-obvious way and that will bite us in the future.

I guess there's a chance even more peephole-y patterns will be added to this pass in the future, which would make ShouldExitEarly even more expensive to check for all rewriting rules implemented. But also, if that function would take short cuts and heuristics, the pass would operate in more surprising and non-obvious ways.

The above made me wonder if to keep compile-time under control, it wouldn't be better to integrate this in one of the existing frameworks for peepholes rather than in a pass of its own. And then I looked at D21571 where this pass got introduced and saw such approaches were probably tried and rejected in early version of that review.
Right now, my best guess of the easiest way forward is to make ShouldExitEarly compute the profitability of all rewrite rules implemented, and let it cache the results of that analysis for each subtarget it encounters. That way, the cost of this computation should be amortized out over all functions the pass will handle, resulting in not having too much impact on compile time?
Having said that, I don't know of another pass that has a caching mechanism like this. But I also haven't tried looking for such an example. I'm not sure if there's going to be a gotcha in trying to cache those results somehow.
What do you think?

Thanks,

Kristof

I put a new version of the code that has all rewriting rules in shouldExitEarly(). I left the old one and name it ShouldExitEarlyShort() and added one rule to it representing ST4 just in case we go with it by the end (will remove one of them during final review). You are are right when you say that more patterns will be added in the future even for this pass (I have not done st3, and post increment store yet). So, the list will at least double.

I have not looked into the caching mechanism you are suggesting yet. Do you think it is acceptable coding standard for llvm to have such global data structure used between function units? It would help tremendously even though I think people with a sub-target that do not have a need for this pass will mind all these rewriting rules even if called one time.

Made caching mechanism for instruction replacement decisions to work across functions (instead of within function only) by declaring SIMDInstrTable as a global variable.

kristof.beyls added inline comments.Nov 1 2017, 12:49 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
211–214	Making this a global variable indeed very probably isn't acceptable. However, I was thinking to have this be a member variable of the AArch64SIMDInstrOpt MachineFunctionPass. Wouldn't that still allow to cache/reuse this table between different calls of "runOnMachineFunction"? I still haven't had time to check if this pattern shows up anywhere else in any other Pass.

az updated this revision to Diff 121389.Nov 2 2017, 3:10 PM

az marked an inline comment as done.

kristof.beyls added inline comments.Nov 3 2017, 8:20 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
220–227	Is the only reason you need to drop const from shouldReplaceInst and other methods that you're using operator[] here to get to an element that's guaranteed to be in the map? If so, wouldn't it be better to use syntax "*SIMDInstrTable.find(InstDesc->getOpcode())", so that the const can remain on all those methods?
392–412	From a logical point of view, this is a lot of code that repeats information encoded in a different way in the big switch statements in optimizeLdStInterleave. It'll be easy to update one location in the future and forget the other location. Couldn't the logic in the switch statements in optimizeLdStInterleave be reused here somehow, so that the information is only encoded once?

kristof.beyls added inline comments.Nov 3 2017, 9:25 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

70–151

The cache here will also need to take into account the TargetSubtargetInfo as the key into the cache.
Otherwise, wouldn't you get unexpected results for e.g. the following .ll file with a mixture of targeting different subtargets (see "target-cpu" function attributes)?

declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)

define <2 x float> @test_vfma_lane_f32_a57(<2 x float> %a, <2 x float> %b, <2 x float> %v)
"target-cpu"="cortex-a57" {
entry:
  %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
  ret <2 x float> %0
}

define <2 x float> @test_vfma_lane_f32_exynos(<2 x float> %a, <2 x float> %b, <2 x float> %v)
"target-cpu"="exynos-m1" {
entry:
  %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
  ret <2 x float> %0
}

I have no idea how to get a stable key from TargetSubtargetInfo to use in this cache.

At this point, I'd measure the overhead of this pass e.g. using CTMark to make sure that the caching mechanism is actually needed before investing more time in it.

Adding subtarget as part of key and simplifying some code.

az added inline comments.Nov 6 2017, 5:09 PM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
70–151	I am adding subtarget as part of key.
220–227	Yes, I was having trouble setting SIMDInstrTable[...] to true or false (few lines below the line of code you mention above). It thinks that the map member variable I added to the const class has the constant modifier too.
392–412	Good suggestion to simplify that code. There are several ways to improve it and here is one way: store the re-writing rules in a table member variable of the class and use that table in shouldExitEarly and in optimizeLdStInterleave. Note that I have not changed optimizeLdStInterleave yet (only used new table in shouldExitEarly). I would like to make sure that what I did is the kind of changes you had originally in mind and if so, then I will use it in optimizeLdStInterleave.

kristof.beyls added inline comments.Nov 7 2017, 6:57 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
70–151	Thanks! Could you also add a regression test similar to the LLVM-IR I wrote in my original comment about this to make sure this functionality is tested?
220–227	Thanks, that makes sense to me now.
294–305	I like the table-driven approach a lot better - the code is a lot cleaner, easier to read and easier to maintain. However, wouldn't it be even cleaner to not let the code behave special based on TwoVectors/FourVectors? I guess this would mean spelling out the whole sequence of replacement instructions in the table, instead of using the 'for (unsigned j=0; j<4, j++)' loop here, and it would make the table larger? I still guess that may be the right tradeoff to make, as it will make it easier to understand what the patch does from just reading the table. And I guess it will also make it easier to later add support for other replacement patterns?
392–412	Yes, this is the kind of change I had in mind. I didn't try to figure out if everything could be driven from such a table, but what you wrote in shouldExitEarly is roughly what I had in mind (with a follow-on nit-pick about making the code even more data-driven in my other comment on the new code).

az added inline comments.Nov 7 2017, 7:30 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
392–412	Thank you for the feedback. I will work on the table approach in a day or two. In the meantime, I will refer you to this abandoned patch https://reviews.llvm.org/D27738 where I tried to add a table driven approach to the original pass. It was seen as adding complexity. The tables in this review are for different purpose and I think that we should go with it but I just want to let you about the old attempt.

Code reuse and simplification due to the use of a table that has instruction replacement info.

az marked an inline comment as done.Nov 10 2017, 9:49 AM

az added inline comments.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
70–151	Added your test as is (with a change to function names).
294–305	Removed the TwoVectors/FourVectors fields and enum.

I haven't gone through the whole patch line by line again, but I believe this is almost in a reasonable shape.
Could you do the compilation time benchmarking to ensure this doesn't increase it for target not needing the rewriting?

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
79	s/representd/represented/
86–91	Maybe "RuleST2" instead of "Rule1" and "RuleST4" instead of "Rule2" is a little bit more descriptive?
293–297	This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function. I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*? I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).

az marked an inline comment as done.Nov 16 2017, 11:27 AM

az added inline comments.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

293–297

Here are some compile time numbers I got by running CTMark. Every number represents the best of three runs. The first two columns are the ones that you asked for. The last two columns show the results when the whole pass is disabled. Even though there are some trends such as the A57 compile time numbers in column2 are in general lower than the numbers in column1, the difference is usually small that it can be considered noise.

Let me know if you have any more comments.

Benchmark Exynos-m1 cortex-a57 exynos-m1 (pass disabled) cortex-a57 (pass disabled)

7zip 123.7240 123.1800 122.6360 122.6640

Bullet 82.6360 82.1200 82.0840 82.3680

ClamAV 51.3680 51.1360 50.3680 50.7920

consumer-typeset 34.4440 34.2800 34.3760 34.3800

kimwitu++ 34.3680 34.2480 34.6080 34.2800

lencod 49.8160 49.7760 49.2880 49.2560

mafft 23.0160 23.0360 23.1760 23.0200

SPASS 42.2720 42.1440 41.8800 41.7720

sqlite3 25.0840 25.1520 24.9920 25.1200

tramp3d-v4 37.1520 37.4240 37.4560 37.0240

kristof.beyls added inline comments.Nov 20 2017, 12:56 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

293–297

Thanks for the data!
I've added some geomean calculations:

Benchmark	Exynos-m1	cortex-a57	exynos-m1(pass disabled)	cortex-a57(pass disabled)
7zip	123.724	123.18	122.636	122.664
Bullet	82.636	82.12	82.084	82.368
ClamAV	51.368	51.136	50.368	50.792
consumer-typeset	34.444	34.28	34.376	34.38
kimwitu++	34.368	34.248	34.608	34.28
lencod	49.816	49.776	49.288	49.256
mafft	23.016	23.036	23.176	23.02
SPASS	42.272	42.144	41.88	41.772
sqlite3	25.084	25.152	24.992	25.12
tramp3d-v4	37.152	37.424	37.456	37.024
GEOMEAN	44.14092425	44.06844708	43.9700723	43.90935266
COMPARED TO DISABLED	100.39%	100.36%

So, compared to having the pass disabled, both when targeting Exynos-m1 and Cortex-A57, it seems this adds about 0.4% to the compile time.
If I've understood correctly from earlier comments, at the moment, for Cortex-A57, this pass isn't expected to make changes to generated code.
I guess that on this set of benchmarks, even for Exynos-m1, this pass may not make many, if any, changes?
If so, adding 0.4% compile time for not changing the program output seems a bit high to me.
@mcrosier : since you were concerned about compile time of this pass before, what do you think?

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

293–297

0.4% may be a bit high but I am kind of see it within the noise range (for example and on few benchmarks, the difference between the low and the high is sometimes bigger than 0.4%). Having said that about the accuracy of the benchmark, I think that we should reduce the number of entries in the table used by the function shouldExitEarly which determines whether we should go through the pass or not. In the original patch, I checked on 1 rewrite rule only to enter the pass. The current version checks on every rewrite rule. But, some are redundant if we consider how the hardware in general implements these instructions. For example such as 128-bit st2, we check the rewrite rules and access latency for st2.4s, st2.8h, and st2.16h. I believe that we can check for only one instead of all three because the hardware implements these 3 instructions in a similar way. I reduced the table from 14 entries to just 4 and I am getting these numbers for cortex-a57 with and without the pass:

Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best)

7zip 122.1760 123.1520 123.2600 122.17 122.4800 123.2520 123.7160 122.48

Bullet: 83.2120 82.6360 82.6160 82.61 82.9200 82.6880 82.6880 82.68

ClamAV: 51.3040 51.6800 51.5520 51.30 50.9000 51.1160 51.1320 50.90

consumer-typeset: 34.5240 34.4960 34.5600 34.49 34.2480 34.5760 34.7560 34.24

kimwitu++: 34.5200 34.3800 34.4000 34.38 34.6960 34.3600 34.6480 34.36

lencod: 50.1840 50.0480 50.1040 50.10 49.9400 50.0160 49.8940 49.89

mafft: 23.1560 23.3640 23.4320 23.15 23.3800 23.2920 23.3880 23.29

9 SPASS: 41.8960 42.0240 42.0440 41.89 42.2440 42.0120 41.8920 41.89

sqlite3: 25.2840 25.1520 25.3520 25.15 25.2520 25.3760 25.4280 25.25

tramp3d-v4: 39.1200 37.3160 37.4800 37.31 37.6840 40.0840 37.5480 37.54

Geomean: 44.21158 44.12472

COMPARED TO DISABLED: 100.1968%

Let me know if you would like to see the patch for that.

kristof.beyls added inline comments.Nov 27 2017, 5:31 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
293–297	It's an option to try and reduce the number of instruction checked in the function shouldExitEarly, but I think that should only be considered as a last resort, as it makes the behaviour of the pass potentially "strange" if on some strange micro-architecture in the future, rewriting rules should be applied differently for e.g. st2.4s, st2.8h and st2.16h. I agree that's unlikely, but not impossible. I think one more thing to try and do first is to reduce the amount of work done in shouldExitEarly when the results should already by cached. The code for shouldExitEarly at the moment is: bool AArch64SIMDInstrOpt::shouldExitEarly(MachineFunction MF, Subpass SP) { const MCInstrDesc OriginalMCID; SmallVector<const MCInstrDesc*, MaxNumRepl> ReplInstrMCID; switch (SP) { case VectorElem: OriginalMCID = &TII->get(AArch64::FMLAv4i32_indexed); ReplInstrMCID.push_back(&TII->get(AArch64::DUPv4i32lane)); ReplInstrMCID.push_back(&TII->get(AArch64::FMULv4f32)); if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID)) return false; break; case Interleave: for (auto &I : IRT) { OriginalMCID = &TII->get(I.OrigOpc); for (auto &Repl : I.ReplOpc) ReplInstrMCID.push_back(&TII->get(Repl)); if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID)) return false; ReplInstrMCID.clear(); } break; } return true; } In the current implementation of this patch, even when all results are already cached, this still results in the for loop in "case Interleave" to still be traversed completely, IIUC, as the caching of results happens inside shouldReplaceInst. I think that it would be better to instead of, or additionally, cache the results at the "shouldExitEarly" level, i.e. cache if shouldExitEarly should return "true" or "false", so that it wouldn't have to traverse the for loop if on a previous invocation, the answer was already "true". Of course this need to take into account the subtarget targeted by the machine function. My guess is that this could have enough impact on compile time reduction that no other changes would be needed. What do you think?

Added caching mechanism to save decisions per target of whether to go into the interleaved store replacement pass or not.

az marked 2 inline comments as done.Nov 29 2017, 4:19 PM

az added inline comments.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

293–297

It seems that the caching mechanism to save the decision per target of whether to enter the instruction replacement pass really improved compile time. Thank you for the idea. I am including one set of runs below where cortex-a57 with pass is slightly worse in terms of compile time than cortex-a57 without the pass. In order to verify on this, I did few other sets of runs (not shown here) and the difference is usually very small including having a set of runs where with "the pass" outperforms "without the pass" in terms of compile time. It seems that the overhead of ShouldExitEarly is really now within the noise range.

Benchmark cortex-a57 (run1 run2 run3, best) cortex-a57 pass disabled (run1 run2 run3, best)
7zip 122.4480 123.0200 122.7160 122.4480 123.0440 122.7360 123.5120 122.7360
Bullet 82.3960 82.2520 83.1720 82.2520 82.7040 82.2640 83.5840 82.2640
ClamAV 51.6320 51.0560 50.6320 50.6320 51.2920 51.3520 51.0040 51.0040
consumer-typese 34.1440 34.5600 34.3360 34.1440 34.7760 33.9560 34.6080 33.9560
kimwitu++ 34.3360 34.4760 34.5440 34.3360 34.9320 34.6400 34.5960 34.5960
lencod 49.9000 49.3280 50.3160 49.9000 49.6480 49.3800 50.0320 49.3800
mafft 23.5080 23.2760 23.0200 23.0200 23.3120 23.2120 23.6120 23.2120
SPASS 42.3160 42.0920 42.3040 42.0920 41.9560 42.2640 41.8040 41.8040
sqlite3 25.3840 26.3680 26.2400 25.3840 25.2080 25.2400 25.1760 25.1760
tramp3d-v4 37.5680 37.0600 37.2520 37.0600 37.3760 37.1960 37.1760 37.1760
Geomean 44.00964 43.98918
Compared to disabled 100.0465%

Thanks for your patience with my long string of review comments.
With just still addressing the few more nitpicks, I think this will have gotten to an acceptable state.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
73–76	std::unordered_map is probably sufficient and slightly faster than std::map?
297–298	It seems that the !InterlEarlyExit.empty() condition is not needed here?

This revision is now accepted and ready to land.Dec 5 2017, 5:51 AM

Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.

I would like also to get your feedback on a couple of things:

Is it worth replacing how code generation (i.e. building new instructions) is implemented in this pass? It is currently hard-coded but we can replace it by a table driven approach similar to the one used in the analysis phase (see IRT Table). In particular, I need to find a simple way to express through a table or other means how the new instructions are built out of the old one and more specifically how the operands of new instructions related to old ones. This adds complexity to the implementation and understanding of the code generation but makes adding more rewrite rules quite simple. It tried this before but the patch was not accepted but may be I should bring it again now that we added several new rewrite pattern in this patch and expect to add more for ST3.

I still need your investigation about the latency for st4/st2 on A57. Based on my unit test case (which is posted somewhere in the comments of this patch), st4 is inefficient for cortex-a57 and the replacement instructions in this patch perform better. If you or somebody from ARM can verify on that, then this means that the latency information in A57 .td file is not accurate and we need to update it. My test for st2 is not conclusive related to whether st2 is better than the replacement instructions or not. Note that, we have few benchmarks that are compiled with cortex-a57 and we are not supposed to change that flag. It would be nice if the A57 .td file has accurate information.

Code has been up-streamed in https://reviews.llvm.org/rL320123 and https://reviews.llvm.org/rL320204.

az marked an inline comment as done.Dec 11 2017, 4:57 PM

In D38196#951927, @az wrote:

Thank you for providing several very useful feedback to improve this patch. In the future (not very soon), I am planning to make this pass more complete by 1) implementing the existing vector element sub-pass in the same way as optimize interleaved store instructions sub-pass for consistency reason (put all rewrite rules and not just one, caching, instruction replacement table, etc.) 2) Addressing the ST3 instruction inefficiency assuming I can find an efficient combination of instructions to replace ST3. I hope I can put you as the reviewer.

I would like also to get your feedback on a couple of things:

Is it worth replacing how code generation (i.e. building new instructions) is implemented in this pass? It is currently hard-coded but we can replace it by a table driven approach similar to the one used in the analysis phase (see IRT Table). In particular, I need to find a simple way to express through a table or other means how the new instructions are built out of the old one and more specifically how the operands of new instructions related to old ones. This adds complexity to the implementation and understanding of the code generation but makes adding more rewrite rules quite simple. It tried this before but the patch was not accepted but may be I should bring it again now that we added several new rewrite pattern in this patch and expect to add more for ST3.

I only have some hand-wavy high-level thoughts/opinions on this.
I see this pass, at a high level, to be about rewriting instruction sequences into sequences that are faster, when the info about instruction latency suggests that should be the case.
I would expect that this could also be beneficial for other, non-AArch64, targets. If that proves to be the case; and if there prove to be quite a few rewriting rules, I think the ideal end game is being able to represent potential rewriting rules per instruction set in some kind of TableGen syntax.
That would allow to keep the rewriting rules specified in a concise, easily understandable and maintainable way.

Going for that design right now seems premature to me: it isn't proven yet that there will be many useful such rewriting rules and it hasn't been proven yet that this is also useful for other instruction sets.
What the current implementation has proven, I think, is that it's possible to implement this functionality in a way such that it doesn't negatively impact compile time too much unnecessarily.

Would it be worth to try and take some steps when adding support for the ST3 pattern towards that hypothetical Tablegen end goal?
I'm not sure - it depends on how many more rewrite rules we expect will get added over time.
Do you happen to have any ideas of what other rewriting rules might plausibly be useful?
Or a method of somehow mining the latency information to somehow automatically suggest instruction sequences that might be worthwhile to consider for rewriting in a different way?

I still need your investigation about the latency for st4/st2 on A57. Based on my unit test case (which is posted somewhere in the comments of this patch), st4 is inefficient for cortex-a57 and the replacement instructions in this patch perform better. If you or somebody from ARM can verify on that, then this means that the latency information in A57 .td file is not accurate and we need to update it. My test for st2 is not conclusive related to whether st2 is better than the replacement instructions or not. Note that, we have few benchmarks that are compiled with cortex-a57 and we are not supposed to change that flag. It would be nice if the A57 .td file has accurate information.

As far as I know, the latency information for Cortex-A57 is correct in the .td file. There's a good chance that in your unit test you might be seeing the effect of another micro-architectural limiting factor than latency, e.g. resource usage, that this pass doesn't take into account when making decisions.

Committed in https://reviews.llvm.org/rL320123

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64.h

4 lines

AArch64TargetMachine.cpp

4 lines

AArch64VectorByElementOpt.cpp

582 lines

test/

CodeGen/

AArch64/

arm64-neon-2velem.ll

21 lines

arm64-st1.ll

107 lines

Diff 124837

llvm/lib/Target/AArch64/AArch64.h

	Show All 33 Lines
	FunctionPass *createAArch64CondBrTuning();			FunctionPass *createAArch64CondBrTuning();
	FunctionPass *createAArch64ConditionalCompares();			FunctionPass *createAArch64ConditionalCompares();
	FunctionPass *createAArch64AdvSIMDScalar();			FunctionPass *createAArch64AdvSIMDScalar();
	FunctionPass *createAArch64ISelDag(AArch64TargetMachine &TM,			FunctionPass *createAArch64ISelDag(AArch64TargetMachine &TM,
	CodeGenOpt::Level OptLevel);			CodeGenOpt::Level OptLevel);
	FunctionPass *createAArch64StorePairSuppressPass();			FunctionPass *createAArch64StorePairSuppressPass();
	FunctionPass *createAArch64ExpandPseudoPass();			FunctionPass *createAArch64ExpandPseudoPass();
	FunctionPass *createAArch64LoadStoreOptimizationPass();			FunctionPass *createAArch64LoadStoreOptimizationPass();
	FunctionPass *createAArch64VectorByElementOptPass();			FunctionPass *createAArch64SIMDInstrOptPass();
	ModulePass *createAArch64PromoteConstantPass();			ModulePass *createAArch64PromoteConstantPass();
	FunctionPass *createAArch64ConditionOptimizerPass();			FunctionPass *createAArch64ConditionOptimizerPass();
	FunctionPass *createAArch64A57FPLoadBalancing();			FunctionPass *createAArch64A57FPLoadBalancing();
	FunctionPass *createAArch64A53Fix835769();			FunctionPass *createAArch64A53Fix835769();
	FunctionPass *createFalkorHWPFFixPass();			FunctionPass *createFalkorHWPFFixPass();
	FunctionPass *createFalkorMarkStridedAccessesPass();			FunctionPass *createFalkorMarkStridedAccessesPass();

	FunctionPass *createAArch64CleanupLocalDynamicTLSPass();			FunctionPass *createAArch64CleanupLocalDynamicTLSPass();

	FunctionPass *createAArch64CollectLOHPass();			FunctionPass *createAArch64CollectLOHPass();
	InstructionSelector *			InstructionSelector *
	createAArch64InstructionSelector(const AArch64TargetMachine &,			createAArch64InstructionSelector(const AArch64TargetMachine &,
	AArch64Subtarget &, AArch64RegisterBankInfo &);			AArch64Subtarget &, AArch64RegisterBankInfo &);

	void initializeAArch64A53Fix835769Pass(PassRegistry&);			void initializeAArch64A53Fix835769Pass(PassRegistry&);
	void initializeAArch64A57FPLoadBalancingPass(PassRegistry&);			void initializeAArch64A57FPLoadBalancingPass(PassRegistry&);
	void initializeAArch64AdvSIMDScalarPass(PassRegistry&);			void initializeAArch64AdvSIMDScalarPass(PassRegistry&);
	void initializeAArch64CollectLOHPass(PassRegistry&);			void initializeAArch64CollectLOHPass(PassRegistry&);
	void initializeAArch64CondBrTuningPass(PassRegistry &);			void initializeAArch64CondBrTuningPass(PassRegistry &);
	void initializeAArch64ConditionalComparesPass(PassRegistry&);			void initializeAArch64ConditionalComparesPass(PassRegistry&);
	void initializeAArch64ConditionOptimizerPass(PassRegistry&);			void initializeAArch64ConditionOptimizerPass(PassRegistry&);
	void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);			void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);
	void initializeAArch64ExpandPseudoPass(PassRegistry&);			void initializeAArch64ExpandPseudoPass(PassRegistry&);
	void initializeAArch64LoadStoreOptPass(PassRegistry&);			void initializeAArch64LoadStoreOptPass(PassRegistry&);
	void initializeAArch64VectorByElementOptPass(PassRegistry&);			void initializeAArch64SIMDInstrOptPass(PassRegistry&);
	void initializeAArch64PromoteConstantPass(PassRegistry&);			void initializeAArch64PromoteConstantPass(PassRegistry&);
	void initializeAArch64RedundantCopyEliminationPass(PassRegistry&);			void initializeAArch64RedundantCopyEliminationPass(PassRegistry&);
	void initializeAArch64StorePairSuppressPass(PassRegistry&);			void initializeAArch64StorePairSuppressPass(PassRegistry&);
	void initializeFalkorHWPFFixPass(PassRegistry&);			void initializeFalkorHWPFFixPass(PassRegistry&);
	void initializeFalkorMarkStridedAccessesLegacyPass(PassRegistry&);			void initializeFalkorMarkStridedAccessesLegacyPass(PassRegistry&);
	void initializeLDTLSCleanupPass(PassRegistry&);			void initializeLDTLSCleanupPass(PassRegistry&);
	} // end namespace llvm			} // end namespace llvm

	#endif			#endif

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp

Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAArch64Target() {
initializeAArch64A57FPLoadBalancingPass(*PR);		initializeAArch64A57FPLoadBalancingPass(*PR);
initializeAArch64AdvSIMDScalarPass(*PR);		initializeAArch64AdvSIMDScalarPass(*PR);
initializeAArch64CollectLOHPass(*PR);		initializeAArch64CollectLOHPass(*PR);
initializeAArch64ConditionalComparesPass(*PR);		initializeAArch64ConditionalComparesPass(*PR);
initializeAArch64ConditionOptimizerPass(*PR);		initializeAArch64ConditionOptimizerPass(*PR);
initializeAArch64DeadRegisterDefinitionsPass(*PR);		initializeAArch64DeadRegisterDefinitionsPass(*PR);
initializeAArch64ExpandPseudoPass(*PR);		initializeAArch64ExpandPseudoPass(*PR);
initializeAArch64LoadStoreOptPass(*PR);		initializeAArch64LoadStoreOptPass(*PR);
initializeAArch64VectorByElementOptPass(*PR);		initializeAArch64SIMDInstrOptPass(*PR);
initializeAArch64PromoteConstantPass(*PR);		initializeAArch64PromoteConstantPass(*PR);
initializeAArch64RedundantCopyEliminationPass(*PR);		initializeAArch64RedundantCopyEliminationPass(*PR);
initializeAArch64StorePairSuppressPass(*PR);		initializeAArch64StorePairSuppressPass(*PR);
initializeFalkorHWPFFixPass(*PR);		initializeFalkorHWPFFixPass(*PR);
initializeFalkorMarkStridedAccessesLegacyPass(*PR);		initializeFalkorMarkStridedAccessesLegacyPass(*PR);
initializeLDTLSCleanupPass(*PR);		initializeLDTLSCleanupPass(*PR);
}		}

▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines	bool AArch64PassConfig::addILPOpts() {
if (EnableMCR)		if (EnableMCR)
addPass(&MachineCombinerID);		addPass(&MachineCombinerID);
if (EnableCondBrTuning)		if (EnableCondBrTuning)
addPass(createAArch64CondBrTuning());		addPass(createAArch64CondBrTuning());
if (EnableEarlyIfConversion)		if (EnableEarlyIfConversion)
addPass(&EarlyIfConverterID);		addPass(&EarlyIfConverterID);
if (EnableStPairSuppress)		if (EnableStPairSuppress)
addPass(createAArch64StorePairSuppressPass());		addPass(createAArch64StorePairSuppressPass());
addPass(createAArch64VectorByElementOptPass());		addPass(createAArch64SIMDInstrOptPass());
return true;		return true;
}		}

void AArch64PassConfig::addPreRegAlloc() {		void AArch64PassConfig::addPreRegAlloc() {
// Change dead register definitions to refer to the zero register.		// Change dead register definitions to refer to the zero register.
if (TM->getOptLevel() != CodeGenOpt::None && EnableDeadRegisterElimination)		if (TM->getOptLevel() != CodeGenOpt::None && EnableDeadRegisterElimination)
addPass(createAArch64DeadRegisterDefinitions());		addPass(createAArch64DeadRegisterDefinitions());

▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

//=- AArch64VectorByElementOpt.cpp - AArch64 vector by element inst opt pass =//

// //

kristof.beylsUnsubmitted

Done

I feel like the name of this pass/file should be changed, as the name no longer covers what it does.
It seems the pass now is becoming about replacing certain SIMD patterns with longer but more efficient SIMD patterns.
AArch64SIMDOptimizer might fit also with some of the file names of other optimizer passes already in the AArch64 target?

kristof.beyls: I feel like the name of this pass/file should be changed, as the name no longer covers what it…

azAuthorUnsubmitted

Not Done

I was planning to clean the naming afterward in a separate patch but I just did it now after seeing this Comment. I Changed classes names, pass name, variable name, etc. (as a result, more files with minor changes are included in the review related to changing the pass name). Note that I did not change the file name in here but will do when I commit the code. I chose AArch64SIMDInstrOpt instead of AArch64SIMDOptimizer based on teh assumption that SIMDOptimizer may be confuse people into thinking that we may be doing some vectorization related work. SIMDInstr may convey the idea that we are doing optimization on the generated SIMD instructions. Let me know if you have other ideas about this.

az: I was planning to clean the naming afterward in a separate patch but I just did it now after…

kristof.beylsUnsubmitted

Not Done

Thanks, I think AArch64SIMDInstrOpt is a reasonable name for this.

kristof.beyls: Thanks, I think AArch64SIMDInstrOpt is a reasonable name for this.

// The LLVM Compiler Infrastructure // The LLVM Compiler Infrastructure

// //

// This file is distributed under the University of Illinois Open Source // This file is distributed under the University of Illinois Open Source

// License. See LICENSE.TXT for details. // License. See LICENSE.TXT for details.

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// //

// This file contains a pass that performs optimization for vector by element // This file contains a pass that performs optimization on SIMD instructions

// SIMD instructions. // with high latency by splitting them into more efficient series of

// // instructions.

// Certain SIMD instructions with vector element operand are not efficient.

// Rewrite them into SIMD instructions with vector operands. This rewrite

// is driven by the latency of the instructions.

// //

// 1. Rewrite certain SIMD instructions with vector element due to their

// inefficiency on some targets.

// Example: // Example:

// fmla v0.4s, v1.4s, v2.s[1] // fmla v0.4s, v1.4s, v2.s[1]

// is rewritten into // is rewritten into

// dup v3.4s, v2.s[1] // dup v3.4s, v2.s[1]

// fmla v0.4s, v1.4s, v3.4s // fmla v0.4s, v1.4s, v3.4s

// //

// 2. Rewrite Interleaved memory access instructions due to their

// inefficiency on some targets.

// Example:

// st2 {v0.4s, v1.4s}, addr

// is rewritten into

// zip1 v2.4s, v0.4s, v1.4s

// zip2 v3.4s, v0.4s, v1.4s

// stp q2, q3, addr

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "AArch64InstrInfo.h" #include "AArch64InstrInfo.h"

#include "llvm/ADT/SmallVector.h" #include "llvm/ADT/SmallVector.h"

#include "llvm/ADT/Statistic.h" #include "llvm/ADT/Statistic.h"

#include "llvm/ADT/StringRef.h" #include "llvm/ADT/StringRef.h"

#include "llvm/CodeGen/MachineBasicBlock.h" #include "llvm/CodeGen/MachineBasicBlock.h"

#include "llvm/CodeGen/MachineFunction.h" #include "llvm/CodeGen/MachineFunction.h"

#include "llvm/CodeGen/MachineFunctionPass.h" #include "llvm/CodeGen/MachineFunctionPass.h"

#include "llvm/CodeGen/MachineInstr.h" #include "llvm/CodeGen/MachineInstr.h"

#include "llvm/CodeGen/MachineInstrBuilder.h" #include "llvm/CodeGen/MachineInstrBuilder.h"

#include "llvm/CodeGen/MachineOperand.h" #include "llvm/CodeGen/MachineOperand.h"

#include "llvm/CodeGen/MachineRegisterInfo.h" #include "llvm/CodeGen/MachineRegisterInfo.h"

#include "llvm/CodeGen/TargetSchedule.h" #include "llvm/CodeGen/TargetSchedule.h"

#include "llvm/MC/MCInstrDesc.h" #include "llvm/MC/MCInstrDesc.h"

#include "llvm/MC/MCSchedule.h" #include "llvm/MC/MCSchedule.h"

#include "llvm/Pass.h" #include "llvm/Pass.h"

#include "llvm/Target/TargetInstrInfo.h" #include "llvm/Target/TargetInstrInfo.h"

#include "llvm/Target/TargetSubtargetInfo.h" #include "llvm/Target/TargetSubtargetInfo.h"

#include <map> #include <map>

using namespace llvm; using namespace llvm;

#define DEBUG_TYPE "aarch64-vectorbyelement-opt" #define DEBUG_TYPE "aarch64-simdinstr-opt"

STATISTIC(NumModifiedInstr, STATISTIC(NumModifiedInstr,

"Number of vector by element instructions modified"); "Number of SIMD instructions modified");

#define AARCH64_VECTOR_BY_ELEMENT_OPT_NAME \ #define AARCH64_VECTOR_BY_ELEMENT_OPT_NAME \

"AArch64 vector by element instruction optimization pass" "AArch64 SIMD instructions optimization pass"

namespace { namespace {

struct AArch64VectorByElementOpt : public MachineFunctionPass { struct AArch64SIMDInstrOpt : public MachineFunctionPass {

static char ID; static char ID;

const TargetInstrInfo *TII; const TargetInstrInfo *TII;

MachineRegisterInfo *MRI; MachineRegisterInfo *MRI;

TargetSchedModel SchedModel; TargetSchedModel SchedModel;

AArch64VectorByElementOpt() : MachineFunctionPass(ID) { // The two maps below are used to cache decisions instead of recomputing:

initializeAArch64VectorByElementOptPass(*PassRegistry::getPassRegistry()); // This is used to cache instruction replacement decisions within function

// units and across function units.

std::map<std::pair<unsigned, std::string>, bool> SIMDInstrTable;

kristof.beylsUnsubmitted

Done

In other parts, comments state "Certain SIMD instructions with vector element operand are not efficient."
In this enum, FMLA is used for that situation.
It seems that either the name of the enum value should be changed, or the comments be updated?

kristof.beyls: In other parts, comments state "Certain SIMD instructions with vector element operand are not…

kristof.beylsUnsubmitted

Done

s/unsupportedSub/UnsupportedSub/.
Do you need the UnsupportedSub value?

kristof.beyls: s/unsupportedSub/UnsupportedSub/. Do you need the UnsupportedSub value?

azAuthorUnsubmitted

Not Done

I used to have an assert statement to look for enum value error at the top of the function (see original review). Then, I moved that error checking into the default case of the switch statement (with llvm_unreachable()) based on my understanding of one of your suggestions. That resulted in a compiler warning during the build related to the no need for the default case as all values are covered. So, I added the Unsupported in enum. In this new review, I changed the names of the extra enum value just in case it looks less confusing. While I prefer that, I am perfectly fine with removing them along with the default case and any enum related error checking.

az: I used to have an assert statement to look for enum value error at the top of the function (see…

kristof.beylsUnsubmitted

Done

My feel is that if the "NA/UnsupportedSub" enum value is only used to assert, it doesn't serve a practical purpose and therefore it should be removed?

Rereading my previous review comment that made you add a default case: I agree there shouldn't be a default case
(Furthermore, https://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations suggests not using "default" in switch statements when not necessary.)

Wouldn't the compiler warn when trying to set the Subpass enum to a value different from VectorElem/Interleave, and hence there isn't a need for asserts on this enum value?

kristof.beyls: My feel is that if the "NA/UnsupportedSub" enum value is only used to assert, it doesn't serve…

// This is used to cache the decision of whether to leave the Interleave-Store

// instructions replacement pass early or not for a particular target.

std::map<std::string, bool> InterlEarlyExit;

kristof.beylsUnsubmitted

Done

std::unordered_map is probably sufficient and slightly faster than std::map?

kristof.beyls: std::unordered_map is probably sufficient and slightly faster than std::map?

typedef enum {

VectorElem,

kristof.beylsUnsubmitted

Done

Do you need the unsupportedAcc value?

kristof.beyls: Do you need the unsupportedAcc value?

kristof.beylsUnsubmitted

Done

s/representd/represented/

kristof.beyls: s/representd/represented/

Interleave

} Subpass;

kristof.beylsUnsubmitted

Done

Maybe the comment should indicate why a limit must be placed on the maximum number of instructions replaced & why 10 is a good value?

kristof.beyls: Maybe the comment should indicate why a limit must be placed on the maximum number of…

// Instruction represented by OrigOpc is replaced by instructions in ReplOpc.

struct InstReplInfo {

unsigned OrigOpc;

std::vector<unsigned> ReplOpc;

const TargetRegisterClass RC;

};

#define RuleST2(OpcOrg, OpcR0, OpcR1, OpcR2, RC) \

{OpcOrg, {OpcR0, OpcR1, OpcR2}, RC}

kristof.beylsUnsubmitted

Done

Maybe "RuleST2" instead of "Rule1" and "RuleST4" instead of "Rule2" is a little bit more descriptive?

kristof.beyls: Maybe "RuleST2" instead of "Rule1" and "RuleST4" instead of "Rule2" is a little bit more…

#define RuleST4(OpcOrg, OpcR0, OpcR1, OpcR2, OpcR3, OpcR4, OpcR5, OpcR6, \

OpcR7, OpcR8, OpcR9, RC) \

{OpcOrg, {OpcR0, OpcR1, OpcR2, OpcR3, OpcR4, OpcR5, OpcR6, OpcR7, \

OpcR8, OpcR9}, RC}

// The Instruction Replacement Table:

std::vector<InstReplInfo> IRT = {

// ST2 instructions

RuleST2(AArch64::ST2Twov2d, AArch64::ZIP1v2i64, AArch64::ZIP2v2i64,

AArch64::STPQi, AArch64::FPR128RegClass),

RuleST2(AArch64::ST2Twov4s, AArch64::ZIP1v4i32, AArch64::ZIP2v4i32,

AArch64::STPQi, AArch64::FPR128RegClass),

RuleST2(AArch64::ST2Twov2s, AArch64::ZIP1v2i32, AArch64::ZIP2v2i32,

AArch64::STPDi, AArch64::FPR64RegClass),

RuleST2(AArch64::ST2Twov8h, AArch64::ZIP1v8i16, AArch64::ZIP2v8i16,

AArch64::STPQi, AArch64::FPR128RegClass),

RuleST2(AArch64::ST2Twov4h, AArch64::ZIP1v4i16, AArch64::ZIP2v4i16,

AArch64::STPDi, AArch64::FPR64RegClass),

RuleST2(AArch64::ST2Twov16b, AArch64::ZIP1v16i8, AArch64::ZIP2v16i8,

AArch64::STPQi, AArch64::FPR128RegClass),

RuleST2(AArch64::ST2Twov8b, AArch64::ZIP1v8i8, AArch64::ZIP2v8i8,

AArch64::STPDi, AArch64::FPR64RegClass),

// ST4 instructions

RuleST4(AArch64::ST4Fourv2d, AArch64::ZIP1v2i64, AArch64::ZIP2v2i64,

AArch64::ZIP1v2i64, AArch64::ZIP2v2i64, AArch64::ZIP1v2i64,

AArch64::ZIP2v2i64, AArch64::ZIP1v2i64, AArch64::ZIP2v2i64,

AArch64::STPQi, AArch64::STPQi, AArch64::FPR128RegClass),

RuleST4(AArch64::ST4Fourv4s, AArch64::ZIP1v4i32, AArch64::ZIP2v4i32,

AArch64::ZIP1v4i32, AArch64::ZIP2v4i32, AArch64::ZIP1v4i32,

AArch64::ZIP2v4i32, AArch64::ZIP1v4i32, AArch64::ZIP2v4i32,

AArch64::STPQi, AArch64::STPQi, AArch64::FPR128RegClass),

RuleST4(AArch64::ST4Fourv2s, AArch64::ZIP1v2i32, AArch64::ZIP2v2i32,

AArch64::ZIP1v2i32, AArch64::ZIP2v2i32, AArch64::ZIP1v2i32,

AArch64::ZIP2v2i32, AArch64::ZIP1v2i32, AArch64::ZIP2v2i32,

AArch64::STPDi, AArch64::STPDi, AArch64::FPR64RegClass),

RuleST4(AArch64::ST4Fourv8h, AArch64::ZIP1v8i16, AArch64::ZIP2v8i16,

AArch64::ZIP1v8i16, AArch64::ZIP2v8i16, AArch64::ZIP1v8i16,

AArch64::ZIP2v8i16, AArch64::ZIP1v8i16, AArch64::ZIP2v8i16,

AArch64::STPQi, AArch64::STPQi, AArch64::FPR128RegClass),

RuleST4(AArch64::ST4Fourv4h, AArch64::ZIP1v4i16, AArch64::ZIP2v4i16,

AArch64::ZIP1v4i16, AArch64::ZIP2v4i16, AArch64::ZIP1v4i16,

AArch64::ZIP2v4i16, AArch64::ZIP1v4i16, AArch64::ZIP2v4i16,

AArch64::STPDi, AArch64::STPDi, AArch64::FPR64RegClass),

RuleST4(AArch64::ST4Fourv16b, AArch64::ZIP1v16i8, AArch64::ZIP2v16i8,

AArch64::ZIP1v16i8, AArch64::ZIP2v16i8, AArch64::ZIP1v16i8,

AArch64::ZIP2v16i8, AArch64::ZIP1v16i8, AArch64::ZIP2v16i8,

AArch64::STPQi, AArch64::STPQi, AArch64::FPR128RegClass),

RuleST4(AArch64::ST4Fourv8b, AArch64::ZIP1v8i8, AArch64::ZIP2v8i8,

AArch64::ZIP1v8i8, AArch64::ZIP2v8i8, AArch64::ZIP1v8i8,

AArch64::ZIP2v8i8, AArch64::ZIP1v8i8, AArch64::ZIP2v8i8,

AArch64::STPDi, AArch64::STPDi, AArch64::FPR64RegClass)

};

// A costly instruction is replaced in this work by N efficient instructions

// The maximum of N is curently 10 and it is for ST4 case.

static const unsigned MaxNumRepl = 10;

AArch64SIMDInstrOpt() : MachineFunctionPass(ID) {

initializeAArch64SIMDInstrOptPass(*PassRegistry::getPassRegistry());

} }

kristof.beylsUnsubmitted

Not Done

declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)

define <2 x float> @test_vfma_lane_f32_a57(<2 x float> %a, <2 x float> %b, <2 x float> %v)
"target-cpu"="cortex-a57" {
entry:
  %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
  ret <2 x float> %0
}

define <2 x float> @test_vfma_lane_f32_exynos(<2 x float> %a, <2 x float> %b, <2 x float> %v)
"target-cpu"="exynos-m1" {
entry:
  %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
  %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
  ret <2 x float> %0
}

I have no idea how to get a stable key from TargetSubtargetInfo to use in this cache.

At this point, I'd measure the overhead of this pass e.g. using CTMark to make sure that the caching mechanism is actually needed before investing more time in it.

kristof.beyls: The cache here will also need to take into account the TargetSubtargetInfo as the key into the…

azAuthorUnsubmitted

Not Done

I am adding subtarget as part of key.

az: I am adding subtarget as part of key.

kristof.beylsUnsubmitted

Not Done

Thanks! Could you also add a regression test similar to the LLVM-IR I wrote in my original comment about this to make sure this functionality is tested?

kristof.beyls: Thanks! Could you also add a regression test similar to the LLVM-IR I wrote in my original…

azAuthorUnsubmitted

Not Done

Added your test as is (with a change to function names).

az: Added your test as is (with a change to function names).

/// Based only on latency of instructions, determine if it is cost efficient /// Based only on latency of instructions, determine if it is cost efficient

/// to replace the instruction InstDesc by the two instructions InstDescRep1 /// to replace the instruction InstDesc by the instructions stored in the

kristof.beylsUnsubmitted

Done

The comment states "InstDescRep1", but the argument name in the code is "InstDescRepl"

kristof.beyls: The comment states "InstDescRep1", but the argument name in the code is "InstDescRepl"

/// and InstDescRep2. /// array InstDescRepl.

/// Return true if replacement is recommended. /// Return true if replacement is expected to be faster.

bool bool shouldReplaceInst(MachineFunction *MF, const MCInstrDesc *InstDesc,

shouldReplaceInstruction(MachineFunction *MF, const MCInstrDesc *InstDesc, SmallVectorImpl<const MCInstrDesc*> &ReplInstrMCID);

const MCInstrDesc *InstDescRep1,

const MCInstrDesc *InstDescRep2, /// Determine if we need to exit the instruction replacement optimization

std::map<unsigned, bool> &VecInstElemTable) const; /// subpasses early. This makes sure that Targets with no need for this

/// optimization do not spend any compile time on this subpass other than the

/// Determine if we need to exit the vector by element instruction /// simple check performed here. This simple check is done by comparing the

/// optimization pass early. This makes sure that Targets with no need /// latency of the original instruction to the latency of the replacement

/// for this optimization do not spent any compile time on this pass. /// instructions. We only check for a representative instruction in the class

/// This check is done by comparing the latency of an indexed FMLA /// of instructions and not all concerned instructions. For the VectorElem

/// instruction to the latency of the DUP + the latency of a vector /// subpass, we check for the FMLA instruction while for the interleave subpass

kristof.beylsUnsubmitted

Done

duplicated line in comment.

kristof.beyls: duplicated line in comment.

/// FMLA instruction. We do not check on other related instructions such /// we check for the st2.4s instruction.

/// as FMLS as we assume that if the situation shows up for one /// Return true if early exit of the subpass is recommended.

/// instruction, then it is likely to show up for the related ones. bool shouldExitEarly(MachineFunction *MF, Subpass SP);

kristof.beylsUnsubmitted

Done

I guess as is, the comment cannot be fully true anymore?
Either this checks also the "Interleave" patterns, or this optimization does spend some compile time on this pass for a Target that does not need this optimization?

kristof.beyls: I guess as is, the comment cannot be fully true anymore? Either this checks also the…

/// Return true if early exit of the pass is recommended.

bool earlyExitVectElement(MachineFunction *MF);

/// Check whether an equivalent DUP instruction has already been /// Check whether an equivalent DUP instruction has already been

/// created or not. /// created or not.

/// Return true when the dup instruction already exists. In this case, /// Return true when the dup instruction already exists. In this case,

/// DestReg will point to the destination of the already created DUP. /// DestReg will point to the destination of the already created DUP.

bool reuseDUP(MachineInstr &MI, unsigned DupOpcode, unsigned SrcReg, bool reuseDUP(MachineInstr &MI, unsigned DupOpcode, unsigned SrcReg,

unsigned LaneNumber, unsigned *DestReg) const; unsigned LaneNumber, unsigned *DestReg) const;

/// Certain SIMD instructions with vector element operand are not efficient. /// Certain SIMD instructions with vector element operand are not efficient.

/// Rewrite them into SIMD instructions with vector operands. This rewrite /// Rewrite them into SIMD instructions with vector operands. This rewrite

/// is driven by the latency of the instructions. /// is driven by the latency of the instructions.

/// Return true if the SIMD instruction is modified. /// Return true if the SIMD instruction is modified.

bool optimizeVectElement(MachineInstr &MI, bool optimizeVectElement(MachineInstr &MI);

std::map<unsigned, bool> *VecInstElemTable) const;

/// Process The REG_SEQUENCE instruction, and extract the source

/// operands of the st2/4 instruction from it.

/// Example of such instructions.

/// %dest = REG_SEQUENCE %st2_src1, dsub0, %st2_src2, dsub1;

/// Return true when the instruction is processed successfully.

bool processSeqRegInst(MachineInstr *DefiningMI, unsigned* StReg,

unsigned* StRegKill, unsigned NumArg) const;

/// Load/Store Interleaving instructions are not always beneficial.

/// Replace them by zip instructionand classical load/store.

/// Return true if the SIMD instruction is modified.

bool optimizeLdStInterleave(MachineInstr &MI);

/// Return the number of useful source registers for this

/// instruction (2 for st2 and 4 for st4).

unsigned determineSrcReg(MachineInstr &MI) const;

bool runOnMachineFunction(MachineFunction &Fn) override; bool runOnMachineFunction(MachineFunction &Fn) override;

StringRef getPassName() const override { StringRef getPassName() const override {

return AARCH64_VECTOR_BY_ELEMENT_OPT_NAME; return AARCH64_VECTOR_BY_ELEMENT_OPT_NAME;

} }

}; };

char AArch64VectorByElementOpt::ID = 0; char AArch64SIMDInstrOpt::ID = 0;

} // end anonymous namespace } // end anonymous namespace

INITIALIZE_PASS(AArch64VectorByElementOpt, "aarch64-vectorbyelement-opt", INITIALIZE_PASS(AArch64SIMDInstrOpt, "aarch64-simdinstr-opt",

AARCH64_VECTOR_BY_ELEMENT_OPT_NAME, false, false) AARCH64_VECTOR_BY_ELEMENT_OPT_NAME, false, false)

kristof.beylsUnsubmitted

Done

Making this a global variable indeed very probably isn't acceptable.
However, I was thinking to have this be a member variable of the AArch64SIMDInstrOpt MachineFunctionPass.
Wouldn't that still allow to cache/reuse this table between different calls of "runOnMachineFunction"?
I still haven't had time to check if this pattern shows up anywhere else in any other Pass.

kristof.beyls: Making this a global variable indeed very probably isn't acceptable. However, I was thinking to…

/// Based only on latency of instructions, determine if it is cost efficient /// Based only on latency of instructions, determine if it is cost efficient

/// to replace the instruction InstDesc by the two instructions InstDescRep1 /// to replace the instruction InstDesc by the instructions stored in the

kristof.beylsUnsubmitted

Done

s/InstDescRep1/InstDescRepl/?

kristof.beyls: s/InstDescRep1/InstDescRepl/?

kristof.beylsUnsubmitted

Done

I guess there is a good reason why a vector cannot be used here instead of a pointer and a separate NumRepl argument?

kristof.beyls: I guess there is a good reason why a vector cannot be used here instead of a pointer and a…

kristof.beylsUnsubmitted

Done

"Return true if replacement is expected to be faster" instead?
Also, I wouldn't duplicate the comments on both the function declarations and definitions - it's hard to keep them consistent.

kristof.beyls: "Return true if replacement is expected to be faster" instead? Also, I wouldn't duplicate the…

/// and InstDescRep2. Note that it is assumed in this fuction that an /// array InstDescRepl.

/// instruction of type InstDesc is always replaced by the same two /// Return true if replacement is expected to be faster.

/// instructions as results are cached here. bool AArch64SIMDInstrOpt::

/// Return true if replacement is recommended. shouldReplaceInst(MachineFunction *MF, const MCInstrDesc *InstDesc,

bool AArch64VectorByElementOpt::shouldReplaceInstruction( SmallVectorImpl<const MCInstrDesc*> &InstDescRepl) {

MachineFunction *MF, const MCInstrDesc *InstDesc, // Check if replacement decision is already available in the cached table.

kristof.beylsUnsubmitted

Done

typos

kristof.beyls: typos

kristof.beylsUnsubmitted

Done

"replacement"

kristof.beyls: "replacement"

const MCInstrDesc *InstDescRep1, const MCInstrDesc *InstDescRep2,

std::map<unsigned, bool> &VecInstElemTable) const {

// Check if replacment decision is alredy available in the cached table.

// if so, return it. // if so, return it.

if (!VecInstElemTable.empty() && std::string Subtarget = SchedModel.getSubtargetInfo()->getCPU();

VecInstElemTable.find(InstDesc->getOpcode()) != VecInstElemTable.end()) std::pair<unsigned, std::string> InstID = std::make_pair(InstDesc->getOpcode(), Subtarget);

return VecInstElemTable[InstDesc->getOpcode()]; if (!SIMDInstrTable.empty() &&

kristof.beylsUnsubmitted

Not Done

Is the only reason you need to drop const from shouldReplaceInst and other methods that you're using operator[] here to get to an element that's guaranteed to be in the map?
If so, wouldn't it be better to use syntax "*SIMDInstrTable.find(InstDesc->getOpcode())", so that the const can remain on all those methods?

kristof.beyls: Is the only reason you need to drop const from shouldReplaceInst and other methods that you're…

azAuthorUnsubmitted

Not Done

Yes, I was having trouble setting SIMDInstrTable[...] to true or false (few lines below the line of code you mention above). It thinks that the map member variable I added to the const class has the constant modifier too.

az: Yes, I was having trouble setting SIMDInstrTable[...] to true or false (few lines below the…

kristof.beylsUnsubmitted

Not Done

Thanks, that makes sense to me now.

kristof.beyls: Thanks, that makes sense to me now.

SIMDInstrTable.find(InstID) != SIMDInstrTable.end())

return SIMDInstrTable[InstID];

unsigned SCIdx = InstDesc->getSchedClass(); unsigned SCIdx = InstDesc->getSchedClass();

unsigned SCIdxRep1 = InstDescRep1->getSchedClass();

unsigned SCIdxRep2 = InstDescRep2->getSchedClass();

const MCSchedClassDesc *SCDesc = const MCSchedClassDesc *SCDesc =

kristof.beylsUnsubmitted

Done

This probably should be a SmallVector and the constant MaxNumRepl should disappear?

kristof.beyls: This probably should be a SmallVector and the constant MaxNumRepl should disappear?

SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdx); SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdx);

const MCSchedClassDesc *SCDescRep1 =

SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdxRep1);

const MCSchedClassDesc *SCDescRep2 =

SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdxRep2);

kristof.beylsUnsubmitted

Done

Similarly, this probably needs to be a SmallVector?

kristof.beyls: Similarly, this probably needs to be a SmallVector?

kristof.beylsUnsubmitted

Done

I think it's clearer if the above 2 loops would be fused, making the cryptic variable array variable SCIdxRepl unnecessary.

kristof.beyls: I think it's clearer if the above 2 loops would be fused, making the cryptic variable array…

// If a subtarget does not define resources for any of the instructions // If a subtarget does not define resources for the instructions

// of interest, then return false for no replacement. // of interest, then return false for no replacement.

if (!SCDesc->isValid() || SCDesc->isVariant() || !SCDescRep1->isValid() || const MCSchedClassDesc *SCDescRepl;

SCDescRep1->isVariant() || !SCDescRep2->isValid() || if (!SCDesc->isValid() || SCDesc->isVariant())

SCDescRep2->isVariant()) { {

VecInstElemTable[InstDesc->getOpcode()] = false; SIMDInstrTable[InstID] = false;

return false;

}

for (auto IDesc : InstDescRepl)

javed.absarUnsubmitted

Done

nitpick. Please replace with range_loop (unless there is a strong reason not to)

javed.absar: nitpick. Please replace with range_loop (unless there is a strong reason not to)

{

SCDescRepl = SchedModel.getMCSchedModel()->getSchedClassDesc(

IDesc->getSchedClass());

if (!SCDescRepl->isValid() || SCDescRepl->isVariant())

{

SIMDInstrTable[InstID] = false;

return false; return false;

kristof.beylsUnsubmitted

Done

My guess is that this logic would also be clearer if it would be fused with the previous loop populating SCDescRepl[i].

kristof.beyls: My guess is that this logic would also be clearer if it would be fused with the previous loop…

} }

}

// Replacement cost.

unsigned ReplCost = 0;

for (auto IDesc :InstDescRepl)

ReplCost += SchedModel.computeInstrLatency(IDesc->getOpcode());

if (SchedModel.computeInstrLatency(InstDesc->getOpcode()) > if (SchedModel.computeInstrLatency(InstDesc->getOpcode()) > ReplCost)

SchedModel.computeInstrLatency(InstDescRep1->getOpcode()) + {

SchedModel.computeInstrLatency(InstDescRep2->getOpcode())) { SIMDInstrTable[InstID] = true;

VecInstElemTable[InstDesc->getOpcode()] = true;

return true; return true;

} }

VecInstElemTable[InstDesc->getOpcode()] = false; else

{

SIMDInstrTable[InstID] = false;

return false; return false;

} }

}

/// Determine if we need to exit the vector by element instruction /// Determine if we need to exit the instruction replacement optimization

/// optimization pass early. This makes sure that Targets with no need /// subpasses early. This makes sure that Targets with no need for this

/// for this optimization do not spent any compile time on this pass. /// optimization do not spend any compile time on this subpass other than the

/// This check is done by comparing the latency of an indexed FMLA /// simple check performed here. This simple check is done by comparing the

/// instruction to the latency of the DUP + the latency of a vector /// latency of the original instruction to the latency of the replacement

/// FMLA instruction. We do not check on other related instructions such /// instructions. We only check for a representative instruction in the class of

/// as FMLS as we assume that if the situation shows up for one /// instructions and not all concerned instructions. For the VectorElem subpass,

/// instruction, then it is likely to show up for the related ones. /// we check for the FMLA instruction while for the interleave subpass we check

kristof.beylsUnsubmitted

Done

As pointed out earlier, a vector/SmallVector rather than an array would result in not needing to manually calculate and specify the length of the array here?

kristof.beyls: As pointed out earlier, a vector/SmallVector rather than an array would result in not needing…

/// Return true if early exit of the pass is recommended. /// for the st2.4s instruction.

bool AArch64VectorByElementOpt::earlyExitVectElement(MachineFunction *MF) { /// Return true if early exit of the subpass is recommended.

kristof.beylsUnsubmitted

Done

It seems that "VectElement" isn't relevant in this function name anymore. Maybe it should become "shouldExitEarly" or something similar?

kristof.beyls: It seems that "VectElement" isn't relevant in this function name anymore. Maybe it should…

std::map<unsigned, bool> VecInstElemTable; bool AArch64SIMDInstrOpt::shouldExitEarly(MachineFunction *MF, Subpass SP) {

kristof.beylsUnsubmitted

Done

if this function would us a switch to handle different enum values, the assert here wouldn't be needed; having just an assert or llvm_unreachable on the default case. I think that's more idiomatic in the LLVM code base.

kristof.beyls: if this function would us a switch to handle different enum values, the assert here wouldn't be…

kristof.beylsUnsubmitted

Done

According to the coding standard, sp should probably be SP?

kristof.beyls: According to the coding standard, sp should probably be SP?

const MCInstrDesc *IndexMulMCID = &TII->get(AArch64::FMLAv4i32_indexed); const MCInstrDesc* OriginalMCID;

const MCInstrDesc *DupMCID = &TII->get(AArch64::DUPv4i32lane); SmallVector<const MCInstrDesc*, MaxNumRepl> ReplInstrMCID;

const MCInstrDesc *MulMCID = &TII->get(AArch64::FMULv4f32);

switch (SP) {

case VectorElem:

OriginalMCID = &TII->get(AArch64::FMLAv4i32_indexed);

ReplInstrMCID.push_back(&TII->get(AArch64::DUPv4i32lane));

ReplInstrMCID.push_back(&TII->get(AArch64::FMULv4f32));

if (shouldReplaceInst(MF, OriginalMCID, ReplInstrMCID))

return false;

break;

case Interleave:

// Check if early exit decision is already available in the cached

// table or not.

std::string Subtarget = SchedModel.getSubtargetInfo()->getCPU();

if (!InterlEarlyExit.empty() &&

kristof.beylsUnsubmitted

Done

This still seems to be doing a bit of work, even though the information could already be cached from processing an earlier function.
I guess this could be improved by caching per Subpass PS, rather than per MCInstrDesc*?
I'm not sure if this is needed. Could you benchmark the compile-time impact as is on e.g. CTMark both when targeting Exynos (i.e. when shouldExitEarly returns false), and for another target (i.e. when shouldExitEarly returns true).

kristof.beyls: This still seems to be doing a bit of work, even though the information could already be cached…

azAuthorUnsubmitted

Not Done

Let me know if you have any more comments.

Benchmark Exynos-m1 cortex-a57 exynos-m1 (pass disabled) cortex-a57 (pass disabled)

7zip 123.7240 123.1800 122.6360 122.6640

Bullet 82.6360 82.1200 82.0840 82.3680

ClamAV 51.3680 51.1360 50.3680 50.7920

consumer-typeset 34.4440 34.2800 34.3760 34.3800

kimwitu++ 34.3680 34.2480 34.6080 34.2800

lencod 49.8160 49.7760 49.2880 49.2560

mafft 23.0160 23.0360 23.1760 23.0200

SPASS 42.2720 42.1440 41.8800 41.7720

sqlite3 25.0840 25.1520 24.9920 25.1200

tramp3d-v4 37.1520 37.4240 37.4560 37.0240

az: Here are some compile time numbers I got by running CTMark. Every number represents the best of…

kristof.beylsUnsubmitted

Done

Thanks for the data!
I've added some geomean calculations:

Benchmark	Exynos-m1	cortex-a57	exynos-m1(pass disabled)	cortex-a57(pass disabled)
7zip	123.724	123.18	122.636	122.664
Bullet	82.636	82.12	82.084	82.368
ClamAV	51.368	51.136	50.368	50.792
consumer-typeset	34.444	34.28	34.376	34.38
kimwitu++	34.368	34.248	34.608	34.28
lencod	49.816	49.776	49.288	49.256
mafft	23.016	23.036	23.176	23.02
SPASS	42.272	42.144	41.88	41.772
sqlite3	25.084	25.152	24.992	25.12
tramp3d-v4	37.152	37.424	37.456	37.024
GEOMEAN	44.14092425	44.06844708	43.9700723	43.90935266
COMPARED TO DISABLED	100.39%	100.36%

kristof.beyls: Thanks for the data! I've added some geomean calculations: | Benchmark |Exynos-m1 |cortex…