This is an archive of the discontinued LLVM Phabricator instance.

SystemZTargetTransformInfo cost functions and some common code changes
ClosedPublic

Authored by jonpa on Feb 7 2017, 4:49 AM.

Details

Summary

SystemZTargetTransformInfo methods implemented:

int getArithmeticInstrCost()
int getCastInstrCost()
int getCmpSelInstrCost()

Common code changes:

getCmpSelInstrCost() has gotten an extra parameter to make it possible to pass the actual instruction if it exists. The motivation for this is that the actual cost on SystemZ for compare and select instructions depend on the scalar widths of the vector elements. The vector element compare instruction produces a bitmask for the elements, and the vector select operates with that bitmask. However, if the widths of the elements of the compare / select operands differ, the bitmask must be adjusted with pack or unpack instructions for instance. Therefore it is useful to look at the "other" instruction when evaluating cost for the compare or select instruction.

LoopVectorizer.cpp: Don't consider a vectorized cost for the compare if it is for the conditional back branch in the loop latch.

New ovlerloaded method getScalarizationOverhead() which was factored out of getArithmeticInstrCost(). This method is useful also in the SystemZ backend.

The fix in InstCombineVectorOps.cpp is "in progress", which is obviously needed in some cases. See https://llvm.org/bugs/show_bug.cgi?id=30630

Diff Detail

Event Timeline

jonpa created this revision.Feb 7 2017, 4:49 AM
jonpa updated this revision to Diff 87605.Feb 8 2017, 1:10 AM

At second thought, I removed the InstCombiner changes from this review, since it is not yet quite ready, and the other changes do not depend on it.

jonpa updated this revision to Diff 87794.Feb 9 2017, 4:43 AM

getShuffleCost() implemented. This method is currently in a state of mild confusion: http://lists.llvm.org/pipermail/llvm-dev/2017-February/109978.html.

Minor fixes here and there.

Costs are currently based on number of vectors, which is intuitive to me. Perhaps the more correct way to do this is to use TLI->getTypeLegalizationCost()?

Values from these cost functions now seem quite precise overall. Exception to this is for z10 (generic), values for fptui (i8/i16) and uitofp (i64) might need some tweaking if found needed.

Arbitrary constant cost values have been used for: fp-select: 4 (conditional jump), libcall: 30.

The values for compare / select are now reflecting an upcoming patch for VSELECT, so before this can be commited, it must be approved: https://reviews.llvm.org/D29489.

jonpa updated this revision to Diff 88339.Feb 14 2017, 3:01 AM

Use getNumberOfParts() instead of dividing with 128, to get number of vector registers needed.
Add default value of nullptr for Instruction* argument of getCmpSelInstrCost() in targets.

jonpa updated this revision to Diff 88355.Feb 14 2017, 5:22 AM

getMemoryOpCost() implemented.
SK_Broadcast cost adjusted to reflect vlrep type instructions, which loads and replicates in a single instruction.

jonpa updated this revision to Diff 89092.Feb 20 2017, 1:57 AM

Allow any VF > 16 in SystemZTargetTransformInfo cost functions, since it may actually be queried (at least VF==32).

uweigand edited edge metadata.Feb 20 2017, 6:33 AM

I've looked only at the SystemZ parts ... look basically good to me, but see a number of inline comments.

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
267

This seems to partially duplicate the logic in getMemoryOpCost ... is there a way to unify those?

337

Shouldn't we have And and Xor here as well?

391

Likewise ...

480

Can this not be handled generically via getElSizeLog2Diff like below for the unpack case?

525

These tables look odd ... but I guess if they accurately reflect the cost of the code that is currently being generated, this is fine with me until we improve codegen.

615

When do we get libcalls? At least with z196 and above this should never happen, so we might want to take the ISA level into account here.

731

This needs more explanation why this adjustment is needed in this case (and only in this case).

jonpa updated this revision to Diff 89362.Feb 22 2017, 8:00 AM
jonpa marked 2 inline comments as done.

Patch updated according to review.

CostModel tests added. The CostModel tests reflect the current costsl while using undef operands extensively, which makes the returned values lower than what is typical, for the scalarized instructions. Would it be better to rewrite these tests to use a separate function with arguments for each case, to include the operands extraction cost?

The tests depend on a patch for getOperandsScalarizationOverhead(), for it to handle vector type arguments, when it's called by CostModel:

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20170220/432062.html:
diff --git a/include/llvm/CodeGen/BasicTTIImpl.h b/include/llvm/CodeGen/BasicTTIImpl.h
index d9131da..12f67f9 100644

  • a/include/llvm/CodeGen/BasicTTIImpl.h

+++ b/include/llvm/CodeGen/BasicTTIImpl.h
@@ -312,9 +312,16 @@ public:

unsigned Cost = 0;
SmallPtrSet<const Value*, 4> UniqueOperands;
for (const Value *A : Args) {
  • if (UniqueOperands.insert(A).second)
  • Cost += getScalarizationOverhead(VectorType::get(A->getType(), VF),
  • false, true);

+ if (!isa<UndefValue>(A) && UniqueOperands.insert(A).second) {
+ Type *VecTy = nullptr;
+ if (A->getType()->isVectorTy()) {
+ assert (VF == A->getType()->getVectorNumElements());
+ VecTy = A->getType();
+ }
+ else
+ VecTy = VectorType::get(A->getType(), VF);
+ Cost += getScalarizationOverhead(VecTy, false, true);
+ }

  }
  return Cost;
}
lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
267

Yes - replaced it with a call to getMemoryOpCost() which returns the number of instructions just the same.

337

No - And and Xor are 'Legal', while the ones handled here are 'Custom'. Only for 'Custom' is this needed, since the default implementation then assumes it is twice as expensive, while it is actually not on z13.

391

(same reason as above)

480

Computed by means of a loop instead. Special case also handled.

525

Yes - that was the intent.

615

This was for i128, which shouldn't be there I suppose, so I removed it. You get a libcall if you generate a function like that, but I have not seen it in benchmarks.

731

I refactored the cost computation for a vector truncation into a new function, used both for vector truncate and vector select, where the selected element type is smaller than the compared elements.

See inline comments.

The new test cases look good to me, thanks!

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
337

I see. However, for vector types, Or is also marked as "Legal" -- only for scalar i64 is it Custom.

770

The original code in getUnrollingPreferences also used 2 for 128-bit integer types. Should this be done now here as well?

jonpa updated this revision to Diff 89470.Feb 22 2017, 11:39 PM
jonpa marked an inline comment as done.

Updated per review. See inline comments.

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
337

Ahh, thanks. Removed Or.

770

That's never seen by the LoopVectorizer, but I think you are right that it should be handled here since it's actually handled by the backend. Thanks.

Changed to check for scalars of 128 bits.

The SystemZ parts LGTM now.

jonpa added a comment.Feb 23 2017, 6:59 AM

Great!

The rest of the review shouldn't be to much to go through:

  • getCmpSelInstrCost() has gotten a new default nullptr parameter for the Instruction. This is so that when the instruction is available, it can be passed. This is needed for SystemZ to estimate how many instructions are needed in addition to the vector compare and vector select instructions (it depends on element widths). LoopVectorizer and CostModel passes the instruction, and elswhere it is just ignored. Target implementation has also gotten the (unused) argument in declaration and definition of the method.
  • BasicTTIImpl has gotten a new method getScalarizationOverhead(), that contains factored-out code from getArithmeticInstrCost(), so that the SystemZ (and potentially others) implementation can use it.
  • LoopVectorizer: Don't consider the compare in the loop latch (used by the conditional back-branch), to be vectorized. This doesn't make sense, it will always be scalar, right? Renato?
rengolin edited edge metadata.
  • getCmpSelInstrCost() has gotten a new default nullptr parameter for the Instruction. This is so that when the instruction is available, it can be passed. This is needed for SystemZ to estimate how many instructions are needed in addition to the vector compare and vector select instructions (it depends on element widths). LoopVectorizer and CostModel passes the instruction, and elswhere it is just ignored. Target implementation has also gotten the (unused) argument in declaration and definition of the method.

This doesn't look bad, but maybe Matthew/Michael have a different plan for such problem. I'm personally ok with this.

  • BasicTTIImpl has gotten a new method getScalarizationOverhead(), that contains factored-out code from getArithmeticInstrCost(), so that the SystemZ (and potentially others) implementation can use it.

I think you could simplify that by implementing the "empty args" logic inside getOperandsScalarizationOverhead.

  • LoopVectorizer: Don't consider the compare in the loop latch (used by the conditional back-branch), to be vectorized. This doesn't make sense, it will always be scalar, right? Renato?

I'm not sure it has to be always scalar. If you have masked vector instructions, the latch could be a vector comparison. Or maybe I didn't get what the problem is. Can you give an example?

--renato

include/llvm/CodeGen/BasicTTIImpl.h
347

This sounds like it should be implemented inside getOperandsScalarizationOverhead

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
711

Code style, try clang-format.

lib/Target/SystemZ/SystemZTargetTransformInfo.h
30

Shouldn't this just be the cost of a call, rather than being a magic constant?

58

Where is this used?

mssimpso added inline comments.Feb 24 2017, 9:25 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7277–7285

Hi,

I've only looked at the vectorizer change here, but this code is not needed. Before computing costs, we collect the uniform values in collectLoopUniforms(). ICmp instructions of the kind here are marked uniform. Then in getInstructionCost(), we check if an instruction is uniform (isUniformAfterVectorization()), and if so, always return "1" for the cost, regardless of VF.

Also, floating-point induction variables aren't allowed to be "primary" induction variables. So it shouldn't be the case that you would have a FCmp feeding the back edge branch.

mssimpso added inline comments.Feb 24 2017, 9:30 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7277–7285

Correction: we always return the cost of the *scalar* compare if it is uniform, regardless of VF.

jonpa updated this revision to Diff 90678.Mar 6 2017, 4:37 AM

Removed experimental heuristic that checked the context of compare instruction in LoopVectorize.cpp.

Just as with getCmpSelInstrCost() and for the same reasons, I added a a new Instruction* argument to getCastInstrCost(). Instruction pointers now passed from everywhere possible, I hope. Moved the assert of the right opcode for passed instruction up in class hiearchy into TargetTransformInfo.cpp.

SystemZTargetTransformInfo.cpp:

  • Handling of i1 extensions (temporary tables removed)
  • factored out a new function getVectorBitMaskConversionCost() from getCmpSelInstrCost(), and reused it for the i1 vector zext/sext instructions.
  • Scalar XOr cost fixed.
  • Allow a noop-truncation query.
  • The cost for a compare has been adjusted to reflect the improvement in the dependent vselect patch: There will no longer always be a vector compare for each vector select.
  • Removed experimental heuristic that checked the context of compare instruction.
  • New tests cmp-ext.ll and scalar-cmp-cmp-log-sel.ll
jonpa updated this revision to Diff 90684.Mar 6 2017, 5:52 AM
jonpa marked an inline comment as done.

@Renato:

I think you could simplify that by implementing the "empty args" logic inside getOperandsScalarizationOverhead()

I tried this by added a third argument to getOperandsScalarizationOverhead(). LoopVectorizer calls this when it knows it has the arguments, so it doesn't need to pass it, so therefore it is default nullptr.

Is this looking better than before?

I'm not sure it has to be always scalar. If you have masked vector instructions, the latch could be a vector comparison. Or maybe I didn't get what the problem is. Can you give an example?

Removed

jonpa updated this revision to Diff 91703.Mar 14 2017, 6:27 AM

The suggested refactoring of getScalarizationOverhead() was changed back, because it was tricky enough to keep track of the different contexts getOperandsScalarizationOverhead() was used in, without a third argument. This seemed right after r297705.

SystemZ part:

  • sdiv/udiv scalar costs adjusted.
  • getVectorInstrCost() implemented for SystemZ (vlvgp)

Does anybody have time to look over if the common code changes are ok? Even though these changes are not very lengthy at all, I wonder if it would help if I split it up into separate reviews?

SystemZ back-end changes and test cases still LGTM. Thanks!

jonpa added inline comments.Mar 17 2017, 12:40 AM
include/llvm/CodeGen/BasicTTIImpl.h
317–329

This part has been approved by Hal Finkel on llvm-commits already (just the first part here that handles scalar/vector argument types).

lib/Target/SystemZ/SystemZTargetTransformInfo.h
30

Maybe, but that is also a magic constant of 10 :-)
Is there any point? This is used just for FRem.

lib/Transforms/Vectorize/LoopVectorize.cpp
7277–7285

Thanks for explaining - I removed this from patch.

jonpa updated this revision to Diff 93835.Apr 3 2017, 3:00 AM

Patch improved:

  • SystemZ memory accesses inteleaving enabled, and getInterleavedMemoryOpCost() implemented. Tested with: test/Transforms/LoopVectorize/SystemZ/mem-interleaving-costs.ll
  • getMemoryOpCost() improved by passing the Instruction pointer, so that operations that fold a load can be considered. Tested with: test/Analysis/CostModel/SystemZ/memop-folding-int-arith.ll
  • BasicTTIImpl.h getCastInstrCost() improved to check for legal extending loads, in which case the cost of the z/sext instruction becomes 0. Tested with: test/Analysis/CostModel/SystemZ/ext-load.ll

Sofar, the passing of the Instruction to the cost methods have been added for:
getCastInstrCost(): For z/sext: check if the operand is a load, and in case target supports a legal matching Z/SEXTLOAD, return 0.
getCmpSelInstrCost(): SystemZ needs to check the def of the vector select mask, in order to compute the extra cost in case the vector compare had a different operand type.
getMemoryOpCost(): SystemZ checks if a load has a single user which is an (arithmetic) instruction that will fold the load into one of its operands

getOperandsScalarizationOverhead(): the assert has been changed so that the passed VF can either be 1 or match the VecTy number of elements. This is needed so that things work from different call contexts.

jonpa updated this revision to Diff 94639.Apr 10 2017, 12:13 AM

Added a slight penalty for moving registers out of vector pipeline into FXU units.
'bool isFPVectorizationPotentiallyUnsafe()' removed from SystemZ, since it has the same default implementation.
New CostModel tests for shuffles and extract/insert element.

Other:
Experimental option removed (CheckFoldedReloads)
New comments about fp32 -- they are expanded and Shuffles and Extracts should actually be free.

PING!

Could anyone please take a minute to review the common code changes?

See one inline comment. Otherwise the SystemZ changes still LGTM, and we now also have positive benchmark results for this change ...

lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
758

Why just index == 0 ? Shouldn't the penalty apply to any element?

jonpa added inline comments.Apr 10 2017, 11:39 PM
lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
758

While the extraction of any element still has a modeled cost of 1 per element, my idea with this is to penalize further the delay of the act of moving out of the vector pipeline. This would then not be per element, but rather for the whole vector register (thus added only once, at index 0). My assumption is that this happens at the point of scalarization, when all vector elements are extracted.

uweigand added inline comments.Apr 11 2017, 3:27 PM
lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
758

Why would this not be per element? Moving from the vector to the integer pipeline always has the higher latency, for every element. In the end any such extracton gets implemented using a VLVG* instruction, which simply is more expensive than other vector instructions ... (Note that in the scheduler, the higher latency for VLVG* is already modeled correctly.)

jonpa added inline comments.Apr 11 2017, 10:27 PM
lib/Target/SystemZ/SystemZTargetTransformInfo.cpp
758

I agree that this should be experimented further with, but the practical argument right now is that I added the smallest possible extra penalty to get rid of a particular regression while affecting code as little as possible. I consider this to be just a first step with obvious improvements in sight for the vectorizers decisions.

I thought the cost here was mainly that the vector register has to pass through the whole vector pipeline. I also imagine the VLGV* instructions would be pipelined, so that the cost of them is not a linear sum in the number of elements? Or is it really twice as expensive to scalarize a <4 x i32> rather than a <2 x i64>?

ping!

The common code parts needs review -- the SystemZ parts are done and proven on benchmarks.

The instruction pointer has been added as a default nullptr argument to: getCastInstrCost(), getCmpSelInstrCost(), getMemoryOpCost(), and the instruction pointer is passed whenever possible by the caller.
@Hal: If I understood you correctly, you are ok with this?

BasicTTIImpl.h:

  • getCastInstrCost() improved to check for legal extending loads, in which case the cost of the z/sext instruction becomes 0. Tested with: test/Analysis/CostModel/SystemZ/ext-load.ll This is good for SystemZ, and hopefully for any target?
  • new overloaded method getScalarizationOverhead() contains factored-out code from getArithmeticInstrCost(), so that the SystemZ (and potentially others) implementation can use it.

@Renato: I tried your suggestion, but then went back because it became messy. Are you ok with this at least for now?

  • in getOperandsScalarizationOverhead(): the assert has been changed so that the passed VF can either be 1 or match the VecTy number of elements. This is needed so that things work from different call contexts.
rengolin accepted this revision.Apr 12 2017, 2:31 AM

Hi Jonas,

Really sorry for not getting back at this earlier. The generic changes are not intrusive enough to warrant a full refactoring, and they're mostly mechanical than existential.

However, I do spot a few problems with this approach, which can be fixed in following patches to the generic parts, since this is now holding a big change in SystemZ.

Right now, some cost functions receive the opcode only, thus not having the same conflict resolution power. Most don't need it, but the choice is currently based on one target's specific needs. Other targets may need to inspect operands and uses of other instructions, and having a different interface for the same kind of functions will confuse the hell out of anyone trying to add costs to their targets. :)

We need to make this more generic and actually pass the instruction instead of the opcode for all, which would clean up the whole mess. But that's for another day.

As is, Hal seems happy with the generic parts, so am I. Ulrich is happy with the System Z parts, so there's no reason not to approve this patch.

Thanks for the large contribution and the patience, and sorry it took so long. LGTM.

cheers,
--renato

This revision is now accepted and ready to land.Apr 12 2017, 2:31 AM
jonpa added a comment.Apr 12 2017, 5:29 AM

Thanks for review.
r300052

Hum, seems some of the costs are wrong in your tests:

http://lab.llvm.org:8011/builders/clang-cmake-aarch64-39vma/builds/5564

Weird that you didn't see those in your machine...

cheers,
--renato

jonpa added a comment.Apr 12 2017, 6:09 AM

Hum, seems some of the costs are wrong in your tests:

http://lab.llvm.org:8011/builders/clang-cmake-aarch64-39vma/builds/5564

Weird that you didn't see those in your machine...

cheers,
--renato

Sorry - one test update unfortunately didn't go in at the first attempt, but it's fixed now.

Sorry - one test update unfortunately didn't go in at the first attempt, but it's fixed now.

Still isn't:

http://lab.llvm.org:8011/builders/clang-cmake-armv7-a15/builds/6066

http://lab.llvm.org:8011/builders/clang-cmake-thumbv7-a15/builds/6059

http://lab.llvm.org:8011/builders/clang-cmake-aarch64-quick/builds/5372

The errors are like:

; CHECK: Cost Model: Found an estimated cost of 2 for instruction: %res11 = and <4 x i64> undef, undef

^

<stdin>:13:1: note: scanning from here
Cost Model: Found an estimated cost of 1 for instruction: %res11 = and <4 x i64> undef, undef

All of them. This should have been caught in your testing and I'm not sure why it hasn't. This breaks both 32 and 64 bits ARM.

Let's revert the patch and I can help you understand the problem once the bots are green.

cheers,
--renato

All these tests fail on Hexagon as well. From what I found out, it seems that "opt -analyze" ignores the mtriple option---you could set it to xyz and the test would still run and produce some output. The Hexagon bot only builds the Hexagon backend, and sets the default triple to hexagon-unknown-elf. I'm guessing that this causes Hexagon's costs to be printed regardless of the mtriple setting, which could lead to these failures.

All these tests fail on Hexagon as well. From what I found out, it seems that "opt -analyze" ignores the mtriple option---you could set it to xyz and the test would still run and produce some output. The Hexagon bot only builds the Hexagon backend, and sets the default triple to hexagon-unknown-elf. I'm guessing that this causes Hexagon's costs to be printed regardless of the mtriple setting, which could lead to these failures.

Ah! Makes sense. Then maybe just adding the lit.config file to SystemZ directory would do?

Ah! Makes sense. Then maybe just adding the lit.config file to SystemZ directory would do?

If you can force a specific triple to be used, then yes. If you can only require a specific target to be present, then it may happen to avoid the failures, but there is no guarantee that it will always work.

Ah! Makes sense. Then maybe just adding the lit.config file to SystemZ directory would do?

If you can force a specific triple to be used, then yes. If you can only require a specific target to be present, then it may happen to avoid the failures, but there is no guarantee that it will always work.

Ok, I created the lit.config file like the others in r300078

Ok, I created the lit.config file like the others in r300078

And r300081 takes care of the rest... :(

And r300081 takes care of the rest... :(

Thanks Renato!

Sorry for this and thank you!

jonpa closed this revision.Apr 12 2017, 10:18 PM