This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Known bits for AArch64ISD::DUP
ClosedPublic

Authored by dmgreen on Jun 19 2022, 10:55 AM.

Details

Summary

An AArch64ISD::DUP is just a splat, where the known bits for each lane are the same as the input. This teaches that to computeKnownBitsForTargetNode.

Problems arise for constants though, as a constant BUILD_VECTOR can be lowered to an AArch64ISD::DUP, which SimplifyDemandedBits would then turn back into a constant BUILD_VECTOR leading to an infinite cycle. This has been prevented by adding a isCanonicalConstantNode node to prevent the conversion back into a BUILD_VECTOR.

Diff Detail

Event Timeline

dmgreen created this revision.Jun 19 2022, 10:55 AM
dmgreen requested review of this revision.Jun 19 2022, 10:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 19 2022, 10:55 AM

Thanks for isCanonicalConstantNode - hopefully it will work for some similar cases we have on X86.

Would AArch64 benefit from overriding isSplatValueForTargetNode to handle AArch64ISD::DUP?

Thanks for isCanonicalConstantNode - hopefully it will work for some similar cases we have on X86.

Would AArch64 benefit from overriding isSplatValueForTargetNode to handle AArch64ISD::DUP?

Yeah that might be useful - thanks for the suggestion, I'll take a look. In the long run I think I would like to remove AArch64ISD::DUP and just use ISD::SPLAT_VECTOR for all vectors across AArch64. That would take some time though, and would hit the same problem of canonicalising into a BUILD_VECTOR.

RKSimon added inline comments.Jun 20 2022, 12:09 AM
llvm/include/llvm/CodeGen/TargetLowering.h
1866

Add doxygen comment

1867

Does this mean we can/should add ISD::SPLAT_VECTOR handling to TargetLowering::SimplifyDemandedBits ?

llvm/lib/Target/AArch64/AArch64ISelLowering.h
1156

We probably want to do something like this in case this gets extended in the future:

return Opc == AArch64ISD::DUP || TargetLoweringBase::isCanonicalConstantNode(Opc, VT);
dmgreen added inline comments.Jun 20 2022, 12:49 AM
llvm/include/llvm/CodeGen/TargetLowering.h
1867

Yeah that sounds good, but we will need to enable SimplifyDemandedBits for scalable vectors first, which is in a followup patch. Enabling ComputeKnownBits for SPLAT_VECTOR is a part of D128159.

dmgreen updated this revision to Diff 438278.Jun 20 2022, 12:50 AM

Update as per review comments - Added doxygen and call the base function from AArch64. Also moved the definition of the function closer to other *ForTargetNode functions, and renamed to isTargetCanonicalConstantNode.

RKSimon accepted this revision.Jun 20 2022, 1:16 AM

LGTM cheers

This revision is now accepted and ready to land.Jun 20 2022, 1:16 AM

I'll take a look. In the long run I think I would like to remove AArch64ISD::DUP and just use ISD::SPLAT_VECTOR for all vectors across AArch64. That would take some time though, and would hit the same problem of canonicalising into a BUILD_VECTOR.

A while back I tried this and there were only a few places that relied on DAGCombine not messing with AArch64ISD::DUP in order to emit good code. I didn't push on with it as I was unsure how others would feel. Given the statement above I think I'll dig it because the easy first step is to restrict AArch64ISD::DUP to only fixed length vectors. Please let me know if you'd rather me hold off.

RKSimon added inline comments.Jun 20 2022, 3:46 AM
llvm/include/llvm/CodeGen/TargetLowering.h
3790

After some testing to reuse this on X86 we might need to change this to take a SDValue Op instead.

X86 shares broadcasted constants across different vector widths (i.e. we broadcast to v8i32 and reuse it for v4i32 as well because we can freely access the bottom subvector) - which means we still have infinite loops from extract_subvector(broadcast_load(constant_pool), 0) patterns.

Not sure if any other targets does anything similar? If not we can keep this as it is for now and I might need to tweak it when x86 support gets added.

A while back I tried this and there were only a few places that relied on DAGCombine not messing with AArch64ISD::DUP in order to emit good code. I didn't push on with it as I was unsure how others would feel. Given the statement above I think I'll dig it because the easy first step is to restrict AArch64ISD::DUP to only fixed length vectors. Please let me know if you'd rather me hold off.

Yeah - that does sound like a good incremental step.

llvm/include/llvm/CodeGen/TargetLowering.h
3790

SDValue sounds good to me - the EVT was always unused anyway and was just added in case it was useful. An SDValue sounds more general than that. I'll make that change as I commit it, feel free to adjust as you need in the future.

This revision was landed with ongoing or failed builds.Jun 20 2022, 11:12 AM
This revision was automatically updated to reflect the committed changes.

This breaks compilation for me, causing hangs when compiling some source files. Repro with https://martin.st/temp/rdopt-preproc.c, built with clang -target aarch64-w64-mingw32 -w -c -O3 rdopt-preproc.c. Previously this completed in ~5 seconds, now it doesn't terminate.

Thanks for the report, I'll take a look and get it fixed. I think it's due to a similar issue Simon mentioned, and our combining of half vectors into full vectors from D126449.

This patch seems to miss the case where the DUP first operand is not an integer type

The following example crashes

define void @dot_product(double %a) {
entry:
  %fadd = call double @llvm.vector.reduce.fadd.v3f64(double %a, <3 x double> <double 1.000000e+00, double 1.000000e+00, double 0.000000e+00>)
  %sqrt = call double @llvm.sqrt.f64(double %fadd)
  %insert = insertelement <3 x double> zeroinitializer, double %sqrt, i64 0
  %shuffle = shufflevector <3 x double> %insert, <3 x double> zeroinitializer, <3 x i32> zeroinitializer
  %mul = fmul <3 x double> %shuffle, <double 1.000000e+00, double 1.000000e+00, double 0.000000e+00>
  %shuffle.1 = extractelement <3 x double> %mul, i64 0
  %shuffle.2 = extractelement <3 x double> %mul, i64 1
  %cmp = fcmp ogt double %shuffle.2, 0.000000e+00
  br i1 %cmp, label %exit, label %bb.1

bb.1:
  %mul.2 = fmul double %shuffle.1, 0.000000e+00
  br label %exit

exit:
  ret void
}

declare double @llvm.sqrt.f64(double)
declare double @llvm.vector.reduce.fadd.v3f64(double, <3 x double>)

link: https://godbolt.org/z/dPc68eWfz