This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Break false dependencies for inactive lanes of unary operations
ClosedPublic

Authored by bsmith on Jul 13 2021, 5:42 AM.

Diff Detail

Event Timeline

bsmith created this revision.Jul 13 2021, 5:42 AM
bsmith requested review of this revision.Jul 13 2021, 5:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2021, 5:42 AM
llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
580

Nit: Asserts are only present if not NDEBUG anyway, so the ifndef seems superfluous (I guess, delete it from the nearby ones since we're here?)

See https://en.cppreference.com/w/cpp/error/assert

Matt added a subscriber: Matt.Jul 13 2021, 8:48 AM
bsmith updated this revision to Diff 359291.Jul 16 2021, 5:26 AM
  • Remove unnecessary NDEBUG ifdefs
efriedma added inline comments.Jul 16 2021, 4:55 PM
llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
485

This "#ifndef NDEBUG" is probably still necessary.

bsmith updated this revision to Diff 359787.Jul 19 2021, 7:34 AM
  • Opt to put the #ifndef DEBUG back in, rationale being that it's clearer even if technically the asserts don't need to be surrounded by them.
paulwalker-arm added inline comments.Jul 21 2021, 9:33 AM
llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
500

This looks weird to me because there's nothing to say DOPReg is actually unique. Here I think ConstructiveUnaryPassthru is special as for it's associated pseudo instructions the passthru operand is effectively irrelevant because all active lanes come from the data operand with the inactive lanes being undef or zero.

I imagine you are forcing true so that the new assert doesn't fire, but I think that new assert is also irrelevant because we don't care about the other operands [1]. I don't have a problem forcing true but the assert just seems like dead code.

That said I'm then wondering what value there is referencing the passthrough value at all, but instead treating the data operand as DOP instead. This seems weird but at the same time consistent with other instructions in that DOP contains real data for the active lanes. If you did this then you will not need special handling within the Create the destructive operation block and in fact the if (FalseZero) part of that block will now be correct for if we even need zeroing pseudos for these unary ops.

What do you think?

I guess I'm suggesting:
std::tie(PredIdx, DOPIdx, SrcIdx) = std::make_tuple(2, 3, 1); or perhaps even std::tie(PredIdx, DOPIdx, SrcIdx) = std::make_tuple(2, 3, 3); which means you could use the same logic as DestructiveBinaryImm when regenerating the unary instruction. This might seem more palatable if changing the name for ConstructiveUnaryPassthru is agreeable.

[1] For what it worth I suspect there's a bug somewhere in this logic whereby DstReg is part of some of the DOPRegIsUnique computations but that's not your problem.

llvm/lib/Target/AArch64/AArch64InstrFormats.td
39

I'll admit that I had no issues with this name until I wrote the large comment for expand_DestructiveOp. That said it has me wondering if these instructions can truely be considered constructive. As I see it the unary passthrough instructions take two operands and munges them to produce a result. This result is then used to overwrite the first operand (the one that is used when calculating the result for inactive lanes), which to my mind means that it is still a destructive operation. So DestructiveUnaryPassthru is a more accurate name. What do you think?

llvm/test/CodeGen/AArch64/sve2-unary-movprfx.ll
8

I don't think you need to test all instructions but there should be a test for each new pattern. Can you ensure there is a test for one instruction within each modified group. For example, I don't see any tests for the changes made to sve_int_un_pred_arit_1_fp, or is it the case there are existing tests for these?

10

For the first few tests can you add a short comment to describe what the test is verifying. I ask because I was initially confused with the output of sqabs_i8_active until I realised _active meant the predicate is all active and thus making the passthrough value dead.

bsmith added inline comments.Jul 22 2021, 7:38 AM
llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp
500

I think it possibly makes sense to do std::tie(PredIdx, DOPIdx, SrcIdx) = std::make_tuple(2, 3, 3);, as that would make it more consistent with the others, and would remove the special case in the movprfx construction (I assume you meant there rather than the Create the destructive operation block?), 2,3,1 seems strange to me as the passthru is not the source operand?

llvm/lib/Target/AArch64/AArch64InstrFormats.td
39

I agree, now I that I think about it like that, I'll change the name of this.

llvm/test/CodeGen/AArch64/sve2-unary-movprfx.ll
8

sve_int_un_pred_arit_1_fp is tested in sve-unary-movprfx.ll (which you possibly missed because it is being hidden since it's large), but point taken, I probably have gone overboard here on tests..

bsmith updated this revision to Diff 360817.Jul 22 2021, 8:05 AM
bsmith marked 4 inline comments as done.
  • Rename ConstructiveUnaryPassthru to DestructiveUnaryPassthru
  • Alter/Simplfy pseudo expand to treat the source register as the data operand
  • Reduce number of tests to only cover each pattern once
  • Add comments clarifying what the tests are checking
paulwalker-arm accepted this revision.Jul 24 2021, 8:12 AM
This revision is now accepted and ready to land.Jul 24 2021, 8:12 AM
This revision was landed with ongoing or failed builds.Jul 26 2021, 8:01 AM
This revision was automatically updated to reflect the committed changes.

Hi, @bsmith, sorry to bother you, could you tell me the background of the patch?
I have a question about the patch, as I notice that the patch adds movprfx before some instructions, and the optimization may affect the performance(degradation).
; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: movprfx z0, z1
; CHECK-NEXT: fsqrt z0.d, p0/m, z1.d

Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2023, 12:17 AM

Hi, @bsmith, sorry to bother you, could you tell me the background of the patch?
I have a question about the patch, as I notice that the patch adds movprfx before some instructions, and the optimization may affect the performance(degradation).
; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: movprfx z0, z1
; CHECK-NEXT: fsqrt z0.d, p0/m, z1.d

The rational for the patch is to ensure the inactive lanes of unary operations don't have a dependence on previous operations when the results for inactive lanes is not important (i.e. undef). Taking the above example, the problem with not breaking the dependency is that if z0.d is used as the result for a preceding long latency instruction then it can negatively impact the issue latency of this instructions despite there being no "real" dependency.

For this reason it is preferred to always break such dependencies via a movprfx instruction. The intent here is that movprfx is suppose to be free because the architecture prefers implementations to combine movprfx'd destructive instructions into constructive instructions. However, we are aware of issues where register allocation is making some poor decisions that is leading to movprfx being over used because dedicated destination registers are being allocated even though it would be advantageous to reuse an operand register.

Hi, @bsmith, sorry to bother you, could you tell me the background of the patch?
I have a question about the patch, as I notice that the patch adds movprfx before some instructions, and the optimization may affect the performance(degradation).
; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: movprfx z0, z1
; CHECK-NEXT: fsqrt z0.d, p0/m, z1.d

The rational for the patch is to ensure the inactive lanes of unary operations don't have a dependence on previous operations when the results for inactive lanes is not important (i.e. undef). Taking the above example, the problem with not breaking the dependency is that if z0.d is used as the result for a preceding long latency instruction then it can negatively impact the issue latency of this instructions despite there being no "real" dependency.

For this reason it is preferred to always break such dependencies via a movprfx instruction. The intent here is that movprfx is suppose to be free because the architecture prefers implementations to combine movprfx'd destructive instructions into constructive instructions. However, we are aware of issues where register allocation is making some poor decisions that is leading to movprfx being over used because dedicated destination registers are being allocated even though it would be advantageous to reuse an operand register.

I got it, thanks for your reply!