This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Improve handling -Ofast generated code by ComplexDeinterleaving pass
ClosedPublic

Authored by igor.kirillov on Apr 17 2023, 11:58 AM.

Details

Summary

Code generated with -Ofast and -O3 -ffp-contract=fast (add
-ffinite-math-only to enable vectorization) can differ significantly.
Code compiled with -O3 can be deinterleaved using patterns as the
instruction order is preserved. However, with the -Ofast flag, there
can be multiple changes in the computation sequence, and even the real
and imaginary parts may not be calculated in parallel.
For more details, refer to
llvm/test/CodeGen/AArch64/complex-deinterleaving-*-fast.ll and
llvm/test/CodeGen/AArch64/complex-deinterleaving-*-contract.ll tests.
This patch implements a more general approach and enables handling most
-Ofast cases.

Depends on D148703

Diff Detail

Event Timeline

igor.kirillov created this revision.Apr 17 2023, 11:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 17 2023, 11:58 AM
igor.kirillov requested review of this revision.Apr 17 2023, 11:58 AM

I would like to bring up some questions:

  1. What is your opinion on the approach taken to handle -Ofast? While it may seem complicated, it attempts to address a broad range of scenarios and I'm not sure how it could be simplified.
  2. Due to the lack of direct link between ComplexDeinterleavingNode and IR Instrucitons when using -Ofast, I decided to use IRBuilder to create all new instructions in ordered way. This code can be extracted into a separate patch.
  3. Do we need additional tests? I have already included tests that demonstrate the difference between -Ofast and -O3 -ffp-contract=fast -ffinite-math-only in D148550. However, now we have some redundant tests in other files.

Due to the lack of direct link between ComplexDeinterleavingNode and IR Instrucitons when using -Ofast, I decided to use IRBuilder to create all new instructions in ordered way. This code can be extracted into a separate patch.

A separate patch would be ideal. There's a lot of changes, and it would help to include only what is related to the -Ofast handling here.

igor.kirillov edited the summary of this revision. (Show Details)

Re-adjust on top of D148703 and D148550. Add a complicated test to show that a sophisticated common sub-expression is processed well

A separate patch would be ideal. There's a lot of changes, and it would help to include only what is related to the -Ofast handling here.

Done, now we have pre-commit patch - D148550 and node replacement refactoring - D148703

Can you precommit the test changes in complex-deinterleaving-multiuses.ll?

llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
902

Nit: Don't think I've ever seen this way to negate a bool. It certainly works, but I'd argue that !IsPositive is a more conventional/readable way

970

One of the original design goals was to not assume full multiplications, instead representing them as 2 partial multiplys. Is that something that is possible to preserve while still accounting for reassoc and -Ofast?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
24710 ↗(On Diff #514945)

Do we want to be discarding the fastmath flags here?

llvm/lib/Target/AArch64/AArch64ISelLowering.h
23–24 ↗(On Diff #514945)

Is this still necessary?

llvm/lib/Target/ARM/ARMISelLowering.cpp
22105 ↗(On Diff #514945)

Do we want to be discarding the fastmath flags here?

igor.kirillov marked 2 inline comments as done.

Address some of the recent comments

igor.kirillov added inline comments.Apr 28 2023, 8:25 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
970

Consider the case of complex number multiplication, such as (x, y) * (u, v). From each partial multiplication, we can only obtain three out of the four values (x, y, u, v). Consequently, it appears that performing a single partial multiplication without considering the second part is imposible. Also, it seems that current code for handling contract flag cases is capable of extracting only full multiplications, despite the detection process being divided into two methods: identifyPartialMul and identifyNodeWithImplicitAdd.

Long story short, I don't know how to separate multiplications from each other.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
24710 ↗(On Diff #514945)

Yeah, you're right. Even though the code works fine without the fastmath flags and they might not be important at this point in the optimization pipeline, it's probably better to keep them. But it's not that simple. First of all, we don't know which instruction to get the flags from when dealing with -Ofast code. My idea is to use the Collect lambda function from identifyReassocNodes and AND all the flags together and store them in the ComplexDeinterleavingNode.

The second issue is that currently, replaceSymmetricNode only uses flags from the Real Instruction. But what if the flags for the Imag Instruction are different?

The third thing to consider is whether we should add flags to the architecture-specific intrinsics generated by createComplexDeinterleavingIR.

I think the best way to handle all this is to add a Flags member to ComplexDeinterleavingNode and then apply those flags when using replaceNode.

What do you think?

Matt added a subscriber: Matt.Apr 28 2023, 3:17 PM
NickGuy added inline comments.May 3 2023, 3:39 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
970

Due to how complex multiplication works, the second partial multiply will always succeed the first, however the first does not always indicate the presence of a second. While not currently implemented, the design does allow for the partial multiplys to be decoupled, theoretically matching cases like o = a * b.real(). The reason why it wasn't implemented initially is down to the arrangement of the shuffles (see below). Given this, I'm happy to defer the implementation for now, but as this patch aims to resolve that, I'd like to revisit the idea once this has landed :)

%strided.vec26 = shufflevector <8 x float> %wide.vec25, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%14 = fmul fast <8 x float> %wide.vec25, %wide.vec
%15 = shufflevector <8 x float> %14, <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
%16 = fmul fast <4 x float> %strided.vec26, %strided.vec24
%interleaved.vec = shufflevector <4 x float> %15, <4 x float> %16, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
24710 ↗(On Diff #514945)

My idea is to use the Collect lambda function from identifyReassocNodes and AND all the flags together and store them in the ComplexDeinterleavingNode.

ANDing them feels like the most sensible to me, either that or ensuring they're all equal during identification.

I think the best way to handle all this is to add a Flags member to ComplexDeinterleavingNode and then apply those flags when using replaceNode.

That makes sense to me, but I'd be wary of how to apply the flags themselves. We'd either need to change the target hook to provide the flags, or have some way for a target to specify which instructions get which flags.
If we do decide to apply the flags on the common side, there's also the problem of applying them to all relevant instructions, we only get the one returned from createComplexDeinterleavingIR, but it could create an entire subgraph of instructions if the target deems it necessary (try the existing implementation with an unrealistic <32 x double> for an example of this).

The second issue is that currently, replaceSymmetricNode only uses flags from the Real Instruction. But what if the flags for the Imag Instruction are different?

That's a fair point. I'd err on the side of caution, and only perform the replacement if the flags are consistent on both the Real and Imag sides.

The third thing to consider is whether we should add flags to the architecture-specific intrinsics generated by createComplexDeinterleavingIR.

That is something I don't feel qualified to answer :D
My instinct says that it wouldn't be a problem, as nothing would be checking the flags on intrinsics, but can't say for sure.

  • Updated the algorithm for detecting partial multiplications. While it still supports only full complex number multiplications, it has been improved to process partial multiplications in the future.
  • Resolve the issue of flag loss by using Symmetric operation
igor.kirillov added inline comments.May 11 2023, 4:14 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
24710 ↗(On Diff #514945)

Added flag check to identifySymmetricOperation. And added two field to ComplexDeinterleavingCompositeNode Opcode and Flags that are used only by replaceSymmetricNode. Also, now I generate Symmetric nodes rather then CAdd with 0 and 180 degree rotation.

igor.kirillov added inline comments.May 11 2023, 4:23 AM
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp
970

I made some changes to the algorithm. Now it collects all possible partial multiplications and identify complex numbers by using common instructions across different partial multiplications. I also added a place with TODO where we could handle independent (non-paired) partial multiplications in the future

1018

I lost this assertion as we don't have Instruction anymore and there are no isUnaryOp / isBinaryOp alternative for opcode

NickGuy accepted this revision.May 23 2023, 8:02 AM

LGTM

This revision is now accepted and ready to land.May 23 2023, 8:02 AM

Rebase, small refactoring replaceNode