This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Fix cost model for `udiv` instruction when one of the operands is a uniform constant
ClosedPublic

Authored by zjaffal on Oct 14 2022, 2:17 PM.

Details

Summary

Currently the model over estimates the cost of a udiv instruction with one constant. The correct cost for a udiv instruction is
insert_cost * extract_cost * num_elements

Diff Detail

Event Timeline

zjaffal created this revision.Oct 14 2022, 2:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2022, 2:17 PM
zjaffal requested review of this revision.Oct 14 2022, 2:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2022, 2:17 PM
dmgreen added inline comments.Oct 17 2022, 12:32 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2188

Using getArithmeticInstrCost here seems very odd.

I would expect the base cost to be N*getVectorInstrCost(Instruction::ExtractElement) + N* getVectorInstrCost(Instruction::InsertElement) + N*getArithmeticInstrCost(Div). Sometimes we use a cheaper cross-register-bank copy cost than getVectorInstrCost would give.

I would expect the getVectorInstrCost's to become removed if one of the operations was uniform or constant. And maybe the Cost += Cost shouldn't be needed if the vector instructions have correct costs. All the code here seems to be trying to give a very high cost for vector divides, as they will just be scalarized.

llvm/test/Analysis/CostModel/AArch64/div.ll
181–183

It seems odd that doing 2 divides, plus some cross-register-bank copies, would be cheaper than doing 1

david-arm added inline comments.Oct 17 2022, 1:05 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2147

I think this is broken because you're assuming that all vectors are of type FixedVectorType. This code will crash if you one of the operands is a constant and we're dealing with scalable vector types. Perhaps it's better to move this to the else case below, i.e.

if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) {
  ...
} else {
  if ((Op1Info.isConstant() && Op1Info.isUniform()) ||
    ...
  // On AArch64, without SVE, vector divisions are expanded

We shouldn't be dealing with scalable vectors at this point so the cast<FixedVectorType>(Ty) ought to be fine.

llvm/test/Analysis/CostModel/AArch64/div.ll
181–183

I agree with @dmgreen, this does look odd.

RKSimon added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2188

You should be able to use getScalarizationOverhead to handle all of the extract/insert costs for you.

Matt added a subscriber: Matt.Oct 19 2022, 5:19 AM
zjaffal added inline comments.Oct 19 2022, 6:39 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2147

I will move it after that block then

2188

I think this probably needs fixing because the assumed result is over a 100 for the cases that I was dealing with if you look at the test cases down below. I think there is repeated counting
because first the base cost is

InstructionCost Cost = BaseT::getArithmeticInstrCost(
    Opcode, Ty, CostKind, Op1Info, Op2Info);

And in the BaseT it calculates the extract and insert cost. This is the relevant snippet from llvm/include/llvm/CodeGen/BasicTTIImpl.h

if (auto *VTy = dyn_cast<FixedVectorType>(Ty)) {
  InstructionCost Cost = thisT()->getArithmeticInstrCost(
      Opcode, VTy->getScalarType(), CostKind, Opd1Info, Opd2Info,
      Args, CxtI);
  // Return the cost of multiple scalar invocation plus the cost of
  // inserting and extracting the values.
  SmallVector<Type *> Tys(Args.size(), Ty);
  return getScalarizationOverhead(VTy, Args, Tys) +
         VTy->getNumElements() * Cost;
}
llvm/test/Analysis/CostModel/AArch64/div.ll
181–183

I assumed that the insert and extract cost is 1. This is because getVectorInstrCost(Instruction::ExtractElement) and getVectorInstrCost(Instruction::InsertElement) is set to 2. I am not sure if I should use those values or keep it as it is.

zjaffal updated this revision to Diff 471125.Oct 27 2022, 4:50 AM

Move the code block to make sure that vectors are fixed size.
Update the cost model to use insert and extract cost

Can you explain more about the motivating case you have? Why should the cost of the vector divide be so low?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2178

getArithmeticInstrCost should not be used with InsertElement or ExtractElement. It is used for arithmetic instructions like Add and Div. getVectorInstrCost should be used with InsertElement and ExtractElement.

For the i64 Mul case below we hard-coded the values 2 instead, due to the more regular nature of the scalarization. (It was originally 1, but we had to increase it as it was not quite high enough in cases).

2183

Remove the debug message in the final version.

2185

This is assuming that the cost of the actual divide is 1? That sounds too low in some cases, compared to scalar. Should it be (InsertCost + ExtractCost + Cost) * VTy->getNumElements()?

zjaffal marked 5 inline comments as done.Oct 31 2022, 2:42 AM
zjaffal added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2178

The code before my patch used getArithmeticInstrCost should we only use getVectorInstrCost then?

2185

The problem is that cost here counts the insert and extract cost so we will end up counting twice.

InstructionCost Cost = BaseT::getArithmeticInstrCost(
    Opcode, Ty, CostKind, Op1Info, Op2Info);

Takes into account the scalarization overhead. For the following example

define <8 x i32> @foo(<8 x i32> %v) {
  %res = udiv <8 x i32> %v, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255>
  ret <8 x i32> %res
}

The cost was 44. I don't think it is correct

dmgreen added inline comments.Oct 31 2022, 3:59 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2178

I think that using getArithmeticInstrCost in the old code was wrong, yeah. But all the old costmode code is pretty odd, like the way it does Cost += Cost;. Using getVectorInstrCost might end up giving too high a cost, if it does I would not be against using a cost of 2, like we do for Mul.

I would expect the basic cost of a vector divide that we expand to be:
(getArithmeticInstrCost(Div, ScalarTy) + 2 * getVectorInstrCost(Extract) + getVectorInstrCost(Insert)) * VF. The 2*ExractCost could be reduced to 1*ExtractCost if the operand is Uniform.

I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".

2185

I see, because we convert the divide into a series of multiplies and shifts. That would apply to scalar divide too, right?

Do we need some code like this?
https://github.com/llvm/llvm-project/blob/72e9447e29abf111c742da59afe4152150a2f8e7/llvm/lib/Target/X86/X86TargetTransformInfo.cpp#L302
It only handles powers-of-2 though, so might need extending for your usecase. Having it somewhere generic would be good too.

zjaffal added inline comments.Nov 2 2022, 3:28 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2178

I tried doing the following now for uniform constants

FixedVectorType *VTy = cast<FixedVectorType>(Ty);
InstructionCost InsertExtractCost;
for(unsigned Idx = 0; Idx < VTy->getNumElements() ; Idx ++){
  InsertExtractCost += getVectorInstrCost(Instruction::InsertElement, VTy, Idx);
  InsertExtractCost += getVectorInstrCost(Instruction::ExtractElement, VTy, Idx);
}
InstructionCost DivCost = getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);
return InsertExtractCost + DivCost * VTy->getNumElements();

It overly estimates the cost of div instructions, take the following example:

%V64i8 = udiv <64 x i8> undef, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7>

The original estimate was: 880
My current patch estimate: 80
Using getArithmeticInstrCost: 424

I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong".

I will put a microbenchmark patch to test the cost result. I think this might give a clear indicator of the cost estimate

dmgreen added inline comments.Nov 2 2022, 8:46 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2178

424 seems like it might it might not be too far off, just counting instructions: https://godbolt.org/z/x8fsYP5Pb. Any 64x vector is going to fairly expensive if we need to scalarize it.

If it was me, I might be tempted to just use (DivCost + 4) * VF, to match the mul variant below. 4 is 2 for insert, 2 for extract. It would be 6 for an instruction with variable operands. It might work better for regular scalarization of integers.

That would rely on getArithmeticInstrCost(Div, ScalarTy) getting the correct cost for non-power2 constants to be precise for the example above, which it might not at the moment.

zjaffal updated this revision to Diff 473975.Nov 8 2022, 6:02 AM

Use scalar div cost + 4 to calcuate uniform costant division cost

Thanks. The existing code is pretty odd, we may want to clean it up at some point. The new code seems better though. If you can clean it up a little as suggested and explain the difference with the scalar costs then this sounds good to me.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
18

Are these changes necessary?

2147

Remove this change now?

2174

if -> If

2182

Should UDiv be Opcode, to handle SDiv a little better?

llvm/test/Analysis/CostModel/AArch64/div.ll
181–183

Can you explain why this isn't (7+4)*2 = 22? Is the scalar cost treated as 1?

zjaffal added inline comments.Nov 9 2022, 6:49 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
18

I don't think so. Automatic imports from vscode

2182

Do you mean using ISD::UDiv instead of Instruction::UDiv.
I tried that but it get the following error if I do that

Assertion failed: (ISD && "Invalid opcode"), function getArithmeticInstrCost, file BasicTTIImpl.h, line 834.

We can do the following instead

int Opcdoe = (ISD == ISD::UDIV) ? Instruction::UDiv : Instruction::SDiv;
dmgreen added inline comments.Nov 9 2022, 7:05 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2182

I just meant the variable/argument Opcode - as far as I understand it could be UDiv or SDiv here.

zjaffal marked 2 inline comments as done.Nov 9 2022, 7:07 AM
zjaffal added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2182

We can do the following then

int Opcdoe = (ISD == ISD::UDIV) ? Instruction::UDiv : Instruction::SDiv;
zjaffal updated this revision to Diff 477764.Nov 24 2022, 6:04 AM
zjaffal marked 9 inline comments as done.

Remove unecessary imports, fix comments.
Opcode can be either UDiv or SDiv

dmgreen accepted this revision.Nov 25 2022, 12:30 AM

Other than changing the parameter passed through, this LGTM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2090

This one :)

2182

I think the Opcode parameter will already be the UDiv/SDiv, and can be used directly. Correct me if I'm wrong.

This revision is now accepted and ready to land.Nov 25 2022, 12:30 AM
zjaffal updated this revision to Diff 478141.Nov 28 2022, 12:35 AM

Use opcode to get div cost

This revision was landed with ongoing or failed builds.Nov 28 2022, 12:42 AM
This revision was automatically updated to reflect the committed changes.