This is an archive of the discontinued LLVM Phabricator instance.

[CostModel] Express cost(urem) as cost(div+mul+sub) when set to Expand.
ClosedPublic

Authored by sdesmalen on Jun 7 2021, 3:01 AM.

Details

Summary

The Legalizer expands the operations of urem/srem into a div+mul+sub or divrem
when those are legal/custom. This patch changes the cost-model to reflect that
cost.

Since there is no 'divrem' Instruction in LLVM IR, the cost of divrem
is assumed to be the same as div+mul+sub since the three operations will
need to be executed at runtime regardless.

Patch co-authored by David Sherwood (@david-arm)

Diff Detail

Event Timeline

sdesmalen created this revision.Jun 7 2021, 3:01 AM
sdesmalen requested review of this revision.Jun 7 2021, 3:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2021, 3:01 AM

This mimics the implementation in SelectionDAG which uses TargetLowering::expandREM (see e.g. D81511)

This sounds OK to me, so long as the X86 numbers are not bonkers.

llvm/include/llvm/CodeGen/BasicTTIImpl.h
788

I don't think you need the brackets here.

This sounds OK to me, so long as the X86 numbers are not bonkers.

Thanks!

@RKSimon do the X86 cost numbers look sensible to you?

I had another look at the costs for X86.

These are a factor three bigger because the cost of a scalar REM is now calculated to be expanded to div+sub+mul rather than the default 1 for 'unknown', which is then multiplied in X86TTIImpl::getArithmeticInstrCost by an arbitrary 20 * LT.first * LT.second.getVectorNumElements(). The comment of that code (https://github.com/llvm/llvm-project/blob/6f3f9535fcafcde11d3b3ef72fdc0f357813e9da/llvm/lib/Target/X86/X86TargetTransformInfo.cpp#L1044) says:

// It is not a good idea to vectorize division. We have to scalarize it and
// in the process we will often end up having to spilling regular
// registers. The overhead of division is going to dominate most kernels
// anyways so try hard to prevent vectorization of division - it is
// generally a bad idea. Assume somewhat arbitrarily that we have to be able
// to hide "20 cycles" for each lane.

The options are to:

  1. Tweak this number '20' to keep the cost similar to what they are now (that would require special-casing the code in X86TTI for UREM/SREM, since the condition is currently bundled with SDIV/UDIV).
  2. Leave it as-is, since the values are already largely arbitrary an arbitrary number and the higher number satisfies the goal of preventing vectorization of division.

Does anyone have a suggestion and/or preference?

I had another look at the costs for X86.

....

The options are to:

  1. Tweak this number '20' to keep the cost similar to what they are now (that would require special-casing the code in X86TTI for UREM/SREM, since the condition is currently bundled with SDIV/UDIV).
  2. Leave it as-is, since the values are already largely arbitrary an arbitrary number and the higher number satisfies the goal of preventing vectorization of division.

Does anyone have a suggestion and/or preference?

This is actually on my backlog - now that we've improved scalarization costs on x86 I'm hoping to add more realistic scalar div/rem costs instead of relying on this kludge. So I'm happy with (2) for now.

This is actually on my backlog - now that we've improved scalarization costs on x86 I'm hoping to add more realistic scalar div/rem costs instead of relying on this kludge. So I'm happy with (2) for now.

Thanks for confirming!

Would you or @dmgreen be happy to accept the patch in that case?

paulwalker-arm accepted this revision.Jul 7 2021, 3:01 AM
This revision is now accepted and ready to land.Jul 7 2021, 3:01 AM
RKSimon accepted this revision.Jul 7 2021, 3:16 AM

LGTM cheers

This revision was landed with ongoing or failed builds.Jul 7 2021, 6:41 AM
This revision was automatically updated to reflect the committed changes.