Currently the model over estimates the cost of a udiv instruction with one constant. The correct cost for a udiv instruction is
insert_cost * extract_cost * num_elements
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
60,030 ms | x64 debian > libFuzzer.libFuzzer::minimize_crash.test |
Event Timeline
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2190 | Using getArithmeticInstrCost here seems very odd. I would expect the base cost to be N*getVectorInstrCost(Instruction::ExtractElement) + N* getVectorInstrCost(Instruction::InsertElement) + N*getArithmeticInstrCost(Div). Sometimes we use a cheaper cross-register-bank copy cost than getVectorInstrCost would give. I would expect the getVectorInstrCost's to become removed if one of the operations was uniform or constant. And maybe the Cost += Cost shouldn't be needed if the vector instructions have correct costs. All the code here seems to be trying to give a very high cost for vector divides, as they will just be scalarized. | |
llvm/test/Analysis/CostModel/AArch64/div.ll | ||
181–183 | It seems odd that doing 2 divides, plus some cross-register-bank copies, would be cheaper than doing 1 |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2147 | I think this is broken because you're assuming that all vectors are of type FixedVectorType. This code will crash if you one of the operands is a constant and we're dealing with scalable vector types. Perhaps it's better to move this to the else case below, i.e. if (TLI->isOperationLegalOrCustom(ISD, LT.second) && ST->hasSVE()) { ... } else { if ((Op1Info.isConstant() && Op1Info.isUniform()) || ... // On AArch64, without SVE, vector divisions are expanded We shouldn't be dealing with scalable vectors at this point so the cast<FixedVectorType>(Ty) ought to be fine. | |
llvm/test/Analysis/CostModel/AArch64/div.ll | ||
181–183 | I agree with @dmgreen, this does look odd. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2190 | You should be able to use getScalarizationOverhead to handle all of the extract/insert costs for you. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2147 | I will move it after that block then | |
2190 | I think this probably needs fixing because the assumed result is over a 100 for the cases that I was dealing with if you look at the test cases down below. I think there is repeated counting InstructionCost Cost = BaseT::getArithmeticInstrCost( Opcode, Ty, CostKind, Op1Info, Op2Info); And in the BaseT it calculates the extract and insert cost. This is the relevant snippet from llvm/include/llvm/CodeGen/BasicTTIImpl.h if (auto *VTy = dyn_cast<FixedVectorType>(Ty)) { InstructionCost Cost = thisT()->getArithmeticInstrCost( Opcode, VTy->getScalarType(), CostKind, Opd1Info, Opd2Info, Args, CxtI); // Return the cost of multiple scalar invocation plus the cost of // inserting and extracting the values. SmallVector<Type *> Tys(Args.size(), Ty); return getScalarizationOverhead(VTy, Args, Tys) + VTy->getNumElements() * Cost; } | |
llvm/test/Analysis/CostModel/AArch64/div.ll | ||
181–183 | I assumed that the insert and extract cost is 1. This is because getVectorInstrCost(Instruction::ExtractElement) and getVectorInstrCost(Instruction::InsertElement) is set to 2. I am not sure if I should use those values or keep it as it is. |
Move the code block to make sure that vectors are fixed size.
Update the cost model to use insert and extract cost
Can you explain more about the motivating case you have? Why should the cost of the vector divide be so low?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2178 | getArithmeticInstrCost should not be used with InsertElement or ExtractElement. It is used for arithmetic instructions like Add and Div. getVectorInstrCost should be used with InsertElement and ExtractElement. For the i64 Mul case below we hard-coded the values 2 instead, due to the more regular nature of the scalarization. (It was originally 1, but we had to increase it as it was not quite high enough in cases). | |
2183 | Remove the debug message in the final version. | |
2185 | This is assuming that the cost of the actual divide is 1? That sounds too low in some cases, compared to scalar. Should it be (InsertCost + ExtractCost + Cost) * VTy->getNumElements()? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2178 | The code before my patch used getArithmeticInstrCost should we only use getVectorInstrCost then? | |
2185 | The problem is that cost here counts the insert and extract cost so we will end up counting twice. InstructionCost Cost = BaseT::getArithmeticInstrCost( Opcode, Ty, CostKind, Op1Info, Op2Info); Takes into account the scalarization overhead. For the following example define <8 x i32> @foo(<8 x i32> %v) { %res = udiv <8 x i32> %v, <i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255, i32 255> ret <8 x i32> %res } The cost was 44. I don't think it is correct |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2178 | I think that using getArithmeticInstrCost in the old code was wrong, yeah. But all the old costmode code is pretty odd, like the way it does Cost += Cost;. Using getVectorInstrCost might end up giving too high a cost, if it does I would not be against using a cost of 2, like we do for Mul. I would expect the basic cost of a vector divide that we expand to be: I haven't done any checking of that scheme though. That's where benchmarks come in, to make sure the theory matches practice and it doesn't end up making things worse. There may have been a reason why the old code got things "wrong". | |
2185 | I see, because we convert the divide into a series of multiplies and shifts. That would apply to scalar divide too, right? Do we need some code like this? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2178 | I tried doing the following now for uniform constants FixedVectorType *VTy = cast<FixedVectorType>(Ty); InstructionCost InsertExtractCost; for(unsigned Idx = 0; Idx < VTy->getNumElements() ; Idx ++){ InsertExtractCost += getVectorInstrCost(Instruction::InsertElement, VTy, Idx); InsertExtractCost += getVectorInstrCost(Instruction::ExtractElement, VTy, Idx); } InstructionCost DivCost = getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind); return InsertExtractCost + DivCost * VTy->getNumElements(); It overly estimates the cost of div instructions, take the following example: %V64i8 = udiv <64 x i8> undef, <i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7> The original estimate was: 880
I will put a microbenchmark patch to test the cost result. I think this might give a clear indicator of the cost estimate |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2178 | 424 seems like it might it might not be too far off, just counting instructions: https://godbolt.org/z/x8fsYP5Pb. Any 64x vector is going to fairly expensive if we need to scalarize it. If it was me, I might be tempted to just use (DivCost + 4) * VF, to match the mul variant below. 4 is 2 for insert, 2 for extract. It would be 6 for an instruction with variable operands. It might work better for regular scalarization of integers. That would rely on getArithmeticInstrCost(Div, ScalarTy) getting the correct cost for non-power2 constants to be precise for the example above, which it might not at the moment. |
Thanks. The existing code is pretty odd, we may want to clean it up at some point. The new code seems better though. If you can clean it up a little as suggested and explain the difference with the scalar costs then this sounds good to me.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
18 | Are these changes necessary? | |
2147 | Remove this change now? | |
2174 | if -> If | |
2182 | Should UDiv be Opcode, to handle SDiv a little better? | |
llvm/test/Analysis/CostModel/AArch64/div.ll | ||
181–183 | Can you explain why this isn't (7+4)*2 = 22? Is the scalar cost treated as 1? |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
18 | I don't think so. Automatic imports from vscode | |
2182 | Do you mean using ISD::UDiv instead of Instruction::UDiv. Assertion failed: (ISD && "Invalid opcode"), function getArithmeticInstrCost, file BasicTTIImpl.h, line 834. We can do the following instead int Opcdoe = (ISD == ISD::UDIV) ? Instruction::UDiv : Instruction::SDiv; |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2182 | I just meant the variable/argument Opcode - as far as I understand it could be UDiv or SDiv here. |
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp | ||
---|---|---|
2182 | We can do the following then int Opcdoe = (ISD == ISD::UDIV) ? Instruction::UDiv : Instruction::SDiv; |
Are these changes necessary?