Page MenuHomePhabricator

[CostModel] remove cost-kind predicate for intrinsics in basic TTI implementation
ClosedPublic

Authored by spatel on Nov 1 2020, 5:45 AM.

Details

Summary

This is the last step in removing cost-kind as a consideration in the basic class model for intrinsics. See D89461 for the start of that. Subsequent commits dealt with each of the special-case intrinsics that had customization here in the basic class. This should remove a barrier to retrying D87188 (canonicalization to the abs intrinsic).

The ARM and x86 cost diffs seen here are likely wrong because the target-specific overrides have their own bugs, but I hope this is less wrong - if something has a significant throughput cost, then it should have a significant size / blended cost too by default.

The only behavioral diff in current regression tests is shown in the x86 scatter-gather test (which is misplaced or broken because it runs the entire -O3 pipeline) - we unrolled less, and that's a good thing if I'm seeing it correctly.

Diff Detail

Event Timeline

spatel created this revision.Nov 1 2020, 5:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 1 2020, 5:45 AM
spatel requested review of this revision.Nov 1 2020, 5:45 AM

Looks like the Arm backend makes almost no attempt to cost intrinsics, so I'm assuming that BasicTTI is basically producing everything here... So, it looks like there's some scalarization cost missing in this nebulous code?

RKSimon added inline comments.Mon, Nov 2, 4:45 AM
llvm/test/Analysis/CostModel/X86/fmaxnum-size-latency.ll
18

Not sure why ymm cost is higher than scalar/xmm - on AVX2 they should be the same (same for f64)

What a glorious mess. I noticed this seemed to be exposing some changes from target independent intrinsics getting different costs than before. D90597 is an attempt to fix that by making them all cheap (but it's hard to tell if it's in exactly the right part of this maze).

spatel added inline comments.Mon, Nov 2, 6:05 AM
llvm/test/Analysis/CostModel/X86/fmaxnum-size-latency.ll
18

Looks like the entries for FMAXNUM are missing in the AVX2CostTbl.
I'll draft a fix.

spatel updated this revision to Diff 302350.Mon, Nov 2, 12:02 PM

Rebased - the x86 diffs changed with D90613 and some cleanups in the test files.

For the ARM tests, I looked at @llvm.sadd.with.overflow() specifically, and the costs are based on this in the basic class:

Cost += thisT()->getArithmeticInstrCost(Opcode, SumTy, CostKind);
Cost += 3 * thisT()->getCmpSelInstrCost(
                BinaryOperator::ICmp, SumTy, OverflowTy,
                CmpInst::BAD_ICMP_PREDICATE, CostKind);
Cost += 2 * thisT()->getCmpSelInstrCost(
                BinaryOperator::ICmp, OverflowTy, OverflowTy,
                CmpInst::BAD_ICMP_PREDICATE, CostKind);
Cost += thisT()->getArithmeticInstrCost(BinaryOperator::And, OverflowTy,
                                        CostKind);

So we get for i32 as an example: 1 (add) + 3 (cmp) + 2 (select) + 2 (and) = 8.
For vector types, it's similarly a sum of those basic ops, so it's up to the target to handle the scalarization costs, and that doesn't seem to be happening. For <4 x i32>, I see: 1 + 3 + 2 + 1 = 7.
And that may be drastically different than the expected throughput cost because we have this in ARMTTIImpl::getArithmeticInstrCost():

// TODO: Handle more cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)

Do we need to fix that before trying this patch?

Cost += 2 * thisT()->getCmpSelInstrCost(
                BinaryOperator::ICmp, OverflowTy, OverflowTy,
                CmpInst::BAD_ICMP_PREDICATE, CostKind);

It looks like this second ICmp should be a select/

Do we need to fix that before trying this patch?

I think it would be wise, just returning the scalarization cost for non-vector targets should get us in the right ball park.

spatel added a comment.Tue, Nov 3, 5:26 AM
Cost += 2 * thisT()->getCmpSelInstrCost(
                BinaryOperator::ICmp, OverflowTy, OverflowTy,
                CmpInst::BAD_ICMP_PREDICATE, CostKind);

It looks like this second ICmp should be a select/

Nice catch...and wow, this is even worse than I imagined. :)

spatel updated this revision to Diff 302869.Wed, Nov 4, 9:28 AM

Rebased after other cost model fixes - D90597, D90681, D90692

I'm still not sure what to make of the ARM diffs, but we probably should make a change to cmp/sel costs either before or after this patch.

Rebased after other cost model fixes - D90597, D90681, D90692

I'm still not sure what to make of the ARM diffs, but we probably should make a change to cmp/sel costs either before or after this patch.

Proposal for the ARM change is here: D90781
If we do want to make that change, I can update/rebase this patch to depend on that.

spatel updated this revision to Diff 303225.Thu, Nov 5, 12:07 PM

Rebased after another ARM change: D90781 / 264a6df353

Anything else visible in here that should be fixed first?

Rebased after another ARM change: D90781 / 264a6df353

Anything else visible in here that should be fixed first?

The less unrolling in scatter-gather test seems good. But I don't have much experience in cost model. There's no objection from me.

llvm/test/Analysis/CostModel/X86/intrinsic-cost-kinds.ll
46–48

Why no diffs in latency?

spatel added inline comments.Fri, Nov 6, 4:51 AM
llvm/test/Analysis/CostModel/X86/intrinsic-cost-kinds.ll
46–48

Latency is currently using a different path before we ever reach functions like getIntrinsicInstrCost(), so it is unaffected. We need to see how latency is used in its callers to decide if it should be changed too.

dmgreen accepted this revision.Mon, Nov 9, 10:13 AM

The Arm numbers look better to me. This LGTM if Sam is not unhappy with it.

This revision is now accepted and ready to land.Mon, Nov 9, 10:13 AM

Thanks, all!
It's almost inevitable that some target will regress on some intrinsic with this change, so I'll be watching for fallout.

Thanks, all!
It's almost inevitable that some target will regress on some intrinsic with this change, so I'll be watching for fallout.

I noticed a change in behaviour due to this commit - on ARM and AArch64, for cases involving the log2 intrinsic - previously, llvm was able to evaluate a big pile of code to deduce that a comparison involving log2 always is true, but after this change, it no longer does that. Is that something you want to have a look into, or is that expected? I wouldn't classify it as a bug in itself, as the code does what it's supposed to, it just optimizes differently than before.

Thanks, all!
It's almost inevitable that some target will regress on some intrinsic with this change, so I'll be watching for fallout.

I noticed a change in behaviour due to this commit - on ARM and AArch64, for cases involving the log2 intrinsic - previously, llvm was able to evaluate a big pile of code to deduce that a comparison involving log2 always is true, but after this change, it no longer does that. Is that something you want to have a look into, or is that expected? I wouldn't classify it as a bug in itself, as the code does what it's supposed to, it just optimizes differently than before.

If you can reduce a test that shows the difference, that would be great. I'm not sure which pass was responsible for the optimization.
My first guess is that we should create a basic class size-cost exception for all of the scalar intrinsics that correspond to libm functions.

I noticed a change in behaviour due to this commit - on ARM and AArch64, for cases involving the log2 intrinsic - previously, llvm was able to evaluate a big pile of code to deduce that a comparison involving log2 always is true, but after this change, it no longer does that. Is that something you want to have a look into, or is that expected? I wouldn't classify it as a bug in itself, as the code does what it's supposed to, it just optimizes differently than before.

If you can reduce a test that shows the difference, that would be great. I'm not sure which pass was responsible for the optimization.
My first guess is that we should create a basic class size-cost exception for all of the scalar intrinsics that correspond to libm functions.

I'm not sure how to reduce it meaningfully, but it's at least observable with https://martin.st/temp/celp_math-preproc.c and https://martin.st/temp/celp_math.ll. With the bitcode, opt -O3 -S celp_math.ll used to produce a no-op main() function before this change, same goes if compiling the preprocessed C code with clang -target aarch64-w64-mingw32 -c -O3 celp_math-preproc.c -fno-math-errno.

I noticed a change in behaviour due to this commit - on ARM and AArch64, for cases involving the log2 intrinsic - previously, llvm was able to evaluate a big pile of code to deduce that a comparison involving log2 always is true, but after this change, it no longer does that. Is that something you want to have a look into, or is that expected? I wouldn't classify it as a bug in itself, as the code does what it's supposed to, it just optimizes differently than before.

If you can reduce a test that shows the difference, that would be great. I'm not sure which pass was responsible for the optimization.
My first guess is that we should create a basic class size-cost exception for all of the scalar intrinsics that correspond to libm functions.

I'm not sure how to reduce it meaningfully, but it's at least observable with https://martin.st/temp/celp_math-preproc.c and https://martin.st/temp/celp_math.ll. With the bitcode, opt -O3 -S celp_math.ll used to produce a no-op main() function before this change, same goes if compiling the preprocessed C code with clang -target aarch64-w64-mingw32 -c -O3 celp_math-preproc.c -fno-math-errno.

Thanks for the test. The difference with @main is in the unroller:
COMPLETELY UNROLLING loop %do.body6 with trip count 11!

And with that, it seems to vanish. Interestingly, the optimization was only happening with -O3 (not -O2) even before this patch because we set the unroll threshold higher at -O3.
I will try to reduce the example and adjust the cost model to make that work again.

Thanks for the test. The difference with @main is in the unroller:
COMPLETELY UNROLLING loop %do.body6 with trip count 11!

And with that, it seems to vanish. Interestingly, the optimization was only happening with -O3 (not -O2) even before this patch because we set the unroll threshold higher at -O3.
I will try to reduce the example and adjust the cost model to make that work again.

Ah, interesting, that explains why this seemingly unrelated change had such an effect.

FWIW I don't have much of a need for the specific case to be optimized the way it was before - I just happened to notice because I had a log2() function implementation that gave approximate results and failed this testcase, but it had been succeeding for ages due to clang optimizing it out altogether before.

Thanks, all!
It's almost inevitable that some target will regress on some intrinsic with this change, so I'll be watching for fallout.

I noticed a change in behaviour due to this commit - on ARM and AArch64, for cases involving the log2 intrinsic - previously, llvm was able to evaluate a big pile of code to deduce that a comparison involving log2 always is true, but after this change, it no longer does that. Is that something you want to have a look into, or is that expected? I wouldn't classify it as a bug in itself, as the code does what it's supposed to, it just optimizes differently than before.

If you can reduce a test that shows the difference, that would be great. I'm not sure which pass was responsible for the optimization.
My first guess is that we should create a basic class size-cost exception for all of the scalar intrinsics that correspond to libm functions.

First guess was close - the C example should be restored with:
8ec7ea3d
Unrolling/inlining are always going to be tricky, so we will likely see other examples like that.

As a heads-up: This is causing a lot of Clang segfaults in Google's builds with sanitizers enabled. We're working work on a reduced testcase, but wanted to let you know while we do that.

As a heads-up: This is causing a lot of Clang segfaults in Google's builds with sanitizers enabled. We're working work on a reduced testcase, but wanted to let you know while we do that.

opt -O2 < a.ll on the following IR triggers an assertion failure:

616│ FixedVectorType *FixedVectorType::get(Type *ElementType, unsigned NumElts) {
617│   assert(NumElts > 0 && "#Elements of a VectorType must be greater than 0");
////////////////// ElementType is MetadataTy, not Integer/FloatingPoint/Pointer
618│   assert(isValidElementType(ElementType) && "Element type of a VectorType must " 
619│                                             "be an integer, floating point, or "
620│                                             "pointer type.");
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define void @wombat() {
bb:
  call void @baz()
  ret void
}

define dso_local void @baz() {
bb:
  %tmp = call <2 x double> @llvm.experimental.constrained.fadd.v2f64(<2 x double> undef, <2 x double> undef, metadata !"round.dynamic", metadata !"fpexcept.ignore")
  ret void
}

; Function Attrs: inaccessiblememonly nofree nosync nounwind willreturn
declare <2 x double> @llvm.experimental.constrained.fadd.v2f64(<2 x double>, <2 x double>, metadata, metadata) #0

attributes #0 = { inaccessiblememonly nofree nosync nounwind willreturn }

As a heads-up: This is causing a lot of Clang segfaults in Google's builds with sanitizers enabled. We're working work on a reduced testcase, but wanted to let you know while we do that.

opt -O2 < a.ll on the following IR triggers an assertion failure:

616│ FixedVectorType *FixedVectorType::get(Type *ElementType, unsigned NumElts) {
617│   assert(NumElts > 0 && "#Elements of a VectorType must be greater than 0");
////////////////// ElementType is MetadataTy, not Integer/FloatingPoint/Pointer
618│   assert(isValidElementType(ElementType) && "Element type of a VectorType must " 
619│                                             "be an integer, floating point, or "
620│                                             "pointer type.");

Thanks, all. Looks like we are blindly walking the intrinsic params (similar problem as 74ffc823ed21) without checking types. I'll work on a fix today.

define dso_local void @baz() {
bb:

%tmp = call <2 x double> @llvm.experimental.constrained.fadd.v2f64(<2 x double> undef, <2 x double> undef, metadata !"round.dynamic", metadata !"fpexcept.ignore")
ret void

}

So to clarify policy, we reverted patches when (1) a test based on an experimental intrinsic was crashing and (2) that crash is easily reproducible independent of this patch as shown here:
7ae346434

I understand that we revert first and ask questions later, but should that be the rule for experimental code?

MaskRay added a subscriber: echristo.EditedFri, Nov 20, 10:38 AM

define dso_local void @baz() {
bb:

%tmp = call <2 x double> @llvm.experimental.constrained.fadd.v2f64(<2 x double> undef, <2 x double> undef, metadata !"round.dynamic", metadata !"fpexcept.ignore")
ret void

}

So to clarify policy, we reverted patches when (1) a test based on an experimental intrinsic was crashing and (2) that crash is easily reproducible independent of this patch as shown here:
7ae346434

I understand that we revert first and ask questions later, but should that be the rule for experimental code?

Hi Sanjay, apologies for the revert churn you have experienced. I have not been with the community long enough, so I'll defer the policy question to @echristo ..

I'll just confirm two facts:

  • llvm.experimental.constrained.fadd.* is due to -frounding-math. The intrinsics have been in-tree for nearly one year now. I am not clear about its stability.
  • At 2583d8eb08073d3c1e06b21b1c4216d0ab7a0909 (the commit before Temporarily Revert "[CostModel] remove cost-kind predicate for intrinsics in basic TTI implementation"), cherry picking either the revert or [CostModel] avoid crashing while finding scalarization overhead fixes the crash for the original IR reproduce (very large), so I agree that this issue is independent (and thanks much for fixing it!). The code path causing the assertion failure was not exercised before this patch, though.

So to clarify policy, we reverted patches when (1) a test based on an experimental intrinsic was crashing and (2) that crash is easily reproducible independent of this patch as shown here:
7ae346434

I understand that we revert first and ask questions later, but should that be the rule for experimental code?

In that case it seems like there was user-visible impact for some clang user according to @brooksmoses comment above? It isn't like an arbitrary Fuzzer was plugged to the system. IMO a revert is "low cost" and easy to re-land with a fix here, little downside to do this?

This comment was removed by kpn.