This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix over-eager fusing of NEON SIMD MUL/ADD
ClosedPublic

Authored by sanwou01 on Nov 25 2019, 8:37 AM.

Details

Summary

The ISel pattern for SIMD MLA is a bit too eager: it replaces the ADD with an
MLA even when the MUL cannot be eliminated, e.g. when it has another use. An
MLA is usually has a higher latency than an ADD (and there are fewer pipes
available that can execute it), so trading an MLA for an ADD is not great.

ISel is not taking the number of uses of the MUL result into account, nor any
other factors such as the length of the critical path or other resource pressure.

The MachineCombiner is able to make these judgments so this patch ports the ISel
pattern for MUL/ADD fusing to the MachineCombiner.

Similarly for MUL/SUB -> MLS, as well as the indexed variants.

The change has no impact on SPEC CPU© intrate nor fprate.

Diff Detail

Event Timeline

sanwou01 created this revision.Nov 25 2019, 8:37 AM
sanwou01 updated this revision to Diff 230913.Nov 25 2019, 8:51 AM
sanwou01 edited the summary of this revision. (Show Details)

Added performance evaluation results

This is because a mla is usually more expensive than a add? Seems sensible to me.

I don't know this code very well though.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
3827–3831

You might want to keep these on separate lines, with the comment making them quite long.

sanwou01 updated this revision to Diff 231024.Nov 26 2019, 1:45 AM
sanwou01 edited the summary of this revision. (Show Details)

Clarified commit message; fixed long lines.

sanwou01 marked an inline comment as done.Nov 26 2019, 1:46 AM

Is the plan to add the indexed variants too? To treat them in the same way.

llvm/include/llvm/CodeGen/MachineCombinerPattern.h
42

Do these need V's on the front? The type is already in the name.

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4197

Is ReplacedAddend ever not nullptr?

sanwou01 updated this revision to Diff 231449.Nov 28 2019, 9:24 AM
sanwou01 marked 3 inline comments as done.
sanwou01 retitled this revision from [AArch64] Fix over-eager fusing of integer NEON MUL/ADD to [AArch64] Fix over-eager fusing of NEON SIMD MUL/ADD.
sanwou01 edited the summary of this revision. (Show Details)

Addressed @dmgreen's comments.

Is the plan to add the indexed variants too? To treat them in the same way.

Good shout. I've given the indexed variants the same treatment.

dmgreen added inline comments.Nov 28 2019, 10:47 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4086

Can you explain what this means? Are these not already SIMD MULADDs?

sanwou01 marked 2 inline comments as done.Nov 29 2019, 1:40 AM
sanwou01 added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4086

Ah, sorry, that TODO was there to remind me to add them :-)

sanwou01 updated this revision to Diff 231495.Nov 29 2019, 1:43 AM
sanwou01 marked an inline comment as done.

Remove redundant TODO

dmgreen accepted this revision.Nov 30 2019, 1:42 AM

Thanks! LGTM

This revision is now accepted and ready to land.Nov 30 2019, 1:42 AM
This revision was automatically updated to reflect the committed changes.

This broke vectorized code in one case. To repro, fetch https://martin.st/temp/g723.c and compile with clang -target aarch64-w64-mingw32 -c -O3 g723.c. The diff in generated code is available at https://martin.st/temp/g723.s.diff.

The key change is this:

-       mul     v4.2s, v3.2s, v4.2s
-       mls     v4.2s, v5.2s, v2.2s
+       mul     v5.2s, v5.2s, v2.2s
+       mls     v5.2s, v3.2s, v4.2s

Thus it seems the sign is flipped; it used to calculate v3*v4-v5*v2, now it calculates v5*v2-v3*v4.

This broke vectorized code in one case. To repro, fetch https://martin.st/temp/g723.c and compile with clang -target aarch64-w64-mingw32 -c -O3 g723.c. The diff in generated code is available at https://martin.st/temp/g723.s.diff.

The key change is this:

-       mul     v4.2s, v3.2s, v4.2s
-       mls     v4.2s, v5.2s, v2.2s
+       mul     v5.2s, v5.2s, v2.2s
+       mls     v5.2s, v3.2s, v4.2s

Thus it seems the sign is flipped; it used to calculate v3*v4-v5*v2, now it calculates v5*v2-v3*v4.

Ah, I messed up MUL + SUB fusing when the MUL is the first operand of the SUB. Looks like those cases weren't being tested either, ah well! A patch is on the way.