This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve cost of vector division by constant
ClosedPublic

Authored by zatrazz on Apr 24 2018, 6:46 AM.

Details

Summary

With custom lowering for vector MULLH{S,U} [1], it is now profitable to
vectorize a divide by constant loop for the custom types (v16i8, v8i16,
and v4i32). The cost if based on TargetLowering::Build{S,U}DIV which
uses a multiply by constant plus adjustment to express a divide by
constant.

Both {u,s}mull{2} are expressed as Instruction::Mul and shifts by
Instruction::AShr.

[1] With custom lowering for vector MULLH{S,U}, it is now profitable to
vectorize a divide by constant loop for the custom types (v16i8, v8i16,
and v4i32). The cost if based on TargetLowering::Build{S,U}DIV which
uses a multiply by constant plus adjustment to express a divide by
constant.

Both {u,s}mull{2} are expressed as Instruction::Mul and shifts by
Instruction::AShr.

Diff Detail

Event Timeline

zatrazz created this revision.Apr 24 2018, 6:46 AM

Do we have similar cost tests for x86, Arm and others? It would be nice to make sure the change to BasicTTIImplBase didn't break other targets' costs.

Do we have similar cost tests for x86, Arm and others? It would be nice to make sure the change to BasicTTIImplBase didn't break other targets' costs.

I should not change other architectures than AArch64 because 'isArithmeticDivFast' is a new method (only used by this patch). I initially added this logic on AArch64TargetTransformInfo.cpp, but since the division by constant is a platform agnostic transformation I though it would make sense to check the same check on a platform agnostic manner.

Ping now that https://reviews.llvm.org/D46009 has been committed.

I should not change other architectures than AArch64 because 'isArithmeticDivFast' is a new method (only used by this patch).

This still looks very AArch64-y to me. Those two ISD nodes are specific to the case you're working on. I'm sure there are a lot more "divs" than that. :)

but since the division by constant is a platform agnostic transformation I though it would make sense to check the same check on a platform agnostic manner.

It is platform agnostic if you cover all cases that a platform agnostic constant divide function should. Right now, it's specific to this one issue in this one arch.

Alternatively, a simple static function here would be more appropriate. We can expand later if it proves to be more than just this case/arch.

I would mention that this is relying on the code generator to pick this exact pattern and hope nothing goes wrong between cost analysis and generation, but this is not new, so not for this patch. :)

--renato

I should not change other architectures than AArch64 because 'isArithmeticDivFast' is a new method (only used by this patch).

This still looks very AArch64-y to me. Those two ISD nodes are specific to the case you're working on. I'm sure there are a lot more "divs" than that. :)

but since the division by constant is a platform agnostic transformation I though it would make sense to check the same check on a platform agnostic manner.

It is platform agnostic if you cover all cases that a platform agnostic constant divide function should. Right now, it's specific to this one issue in this one arch.

The divide-by-constant is lowered in a platform agnostic way at TargetLowering::BuildUDIV and TargetLowering::BuildSDIV (lib/CodeGen/SelectionDAG/TargetLowering.cpp). You can see that if target can lower either ISD::MULHS or ISD::SMUL_LOHI the multiply by inverse plus adjustment algorithm will be used. AArch64 is the case where it currently lowers ISD::MULHS.

What I am not sure if I need to handle the TargetTransformInfo::OP_PowerOf2 case, since DAGCombiner::visitSDIV will apply this transformation prior BuildSDIV/BuildUDIV.

Alternatively, a simple static function here would be more appropriate. We can expand later if it proves to be more than just this case/arch.

I would mention that this is relying on the code generator to pick this exact pattern and hope nothing goes wrong between cost analysis and generation, but this is not new, so not for this patch. :)

Yes, ideally this kind of cost analysis would be better *after* the BuildUDIV/BuildSDIV expansion.

zatrazz updated this revision to Diff 145223.May 4 2018, 10:34 AM

But I do agree that there is not much gain in making the division by constant cost logic generic. I changed by moving the logic on aarch64 target cost class.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
477

And this new line shouldn't be here, I will remove it.

rengolin accepted this revision.May 4 2018, 11:37 AM

LGTM with the line removed. :)

Thanks!

This revision is now accepted and ready to land.May 4 2018, 11:37 AM

I tired this patch on Exynos and the performance of SPEC CPU2000 was virtually neutral, even if slightly negative overall in the integer score. Perhaps the defaults costs are too optimistic about the relative difference between multiply and add?

I tired this patch on Exynos and the performance of SPEC CPU2000 was virtually neutral, even if slightly negative overall in the integer score. Perhaps the defaults costs are too optimistic about the relative difference between multiply and add?

Any idea if the overall result is due the patch itself of system jitter/noise? Does any component outliers significantly?

In fact I couldn't find any code generation different with and without this patch on speccpu2000. Most likely the performance difference is expected runtime variance from speccpu2000 (I do see a lot in some components).

zatrazz closed this revision.May 9 2018, 6:21 AM