Page MenuHomePhabricator

[TTI CostModel] change default cost of FP ops to 1 (PR36280)
AbandonedPublic

Authored by spatel on Feb 8 2018, 9:09 AM.

Details

Summary

This change was mentioned at least as far back as:
https://bugs.llvm.org/show_bug.cgi?id=26837#c26
...and I found a real program that directly shows the harm. Himeno running on AMD Jaguar gets 6% slower with SLP vectorization:
https://bugs.llvm.org/show_bug.cgi?id=36280

I don't know the history here. Maybe this was set in the Pentium 4 days, or there's just confusion about which cost we're modelling.

I've added a comment to make it clear that this is the throughput cost of a math instruction.

The div/rem costs for x86 look very wrong in some cases, but I think that's already true, so we can fix those in follow-up patches. There's also evidence that more cost model changes are needed to solve SLP problems as shown in D42981, but I think that's an independent problem (though the solution may be adjusted assuming this change is approved).

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Feb 8 2018, 9:09 AM
spatel retitled this revision from [TTI CostModel] change default cost of FP ops to 1 (PR to [TTI CostModel] change default cost of FP ops to 1 (PR36280).Feb 8 2018, 9:09 AM
ABataev added inline comments.Feb 8 2018, 9:28 AM
include/llvm/CodeGen/BasicTTIImpl.h
494 ↗(On Diff #133434)

Maybe it is better to add a virtual method that will return the default cost of the integer and floating point operations for the target? The default implementation should keep 1 and 2, but for X86 it should return 1 in both cases.

spatel added inline comments.Feb 8 2018, 9:49 AM
include/llvm/CodeGen/BasicTTIImpl.h
494 ↗(On Diff #133434)

I don't know of any target where anything but '1' is the right default answer, so I'd rather not perpetuate this. I think it's better to correct the problems revealed by this change.

hfinkel added inline comments.Feb 8 2018, 12:01 PM
include/llvm/CodeGen/BasicTTIImpl.h
494 ↗(On Diff #133434)

I agree, but recall that these are measuring reciprocal throughputs, and it's not unreasonable to assume that floating-point ops will have lower throughputs than integer operations in some general sense. Nevertheless, '2' seems somewhat arbitrary in this regard.

spatel added inline comments.Feb 8 2018, 1:28 PM
include/llvm/CodeGen/BasicTTIImpl.h
494 ↗(On Diff #133434)

Ah, the code comment isn't accurate then. And since we're using integers, this is really a relative reciprocal throughput (can't go under 1) rather than an actual cycle-based value?

hfinkel added inline comments.Feb 8 2018, 1:37 PM
include/llvm/CodeGen/BasicTTIImpl.h
494 ↗(On Diff #133434)

Not sure what you mean by "cycle based." All cost model results are relative, but this cost model used by the vectorizer is supposed to return (relative) reciprocal throughputs. We now have a separate cost model for latency (and a separate "user cost" model, which essentially models micro-op count).

spatel added inline comments.Feb 8 2018, 1:50 PM
include/llvm/CodeGen/BasicTTIImpl.h
494 ↗(On Diff #133434)

In the backend, the scheduler model shows instruction timing like this (in test/CodeGen/X86/sse2-schedule.ll):
; SKYLAKE-NEXT: vaddpd %xmm1, %xmm0, %xmm0 # sched: [4:0.50]

The "0.50" means we can execute 2 of these per cycle because there are 2 units that process these instructions.

We're integer 1-based here, so if we want to show that a target has 3 integer adders and 1 FP adder, we would make FP ops cost 3 rather than 1 (instead of the 0.33 for integer adds that we would expect in the scheduler model).

Wouldn't we be better off avoiding affecting generic targets and just adding tuned FADD/FSUB/FMUL costs to getArithmeticInstrCost in X86TargetTransformInfo.cpp? The general rule we used for FDIV/FSQRT was to use the 'worst hardware' cost for a given SSE level - so AVX1 (SB/JG) might have a better cost than SSE42 (P4) etc. which seemed to work pretty well.

If we go that route we might want to add extra cpu targets to some of the x86 slp tests.

test/Transforms/SLPVectorizer/X86/PR36280.ll
4 ↗(On Diff #133434)

It'd probably be useful to include a comment that this code snippet is from the himeno benchmark?

spatel updated this revision to Diff 133605.Feb 9 2018, 6:29 AM
spatel marked 2 inline comments as done.

Patch updated:
No functional changes from the last rev, but tried to make the code comment accurate and added a comment to the PR36280 test to explain the motivation.

spatel added a comment.Feb 9 2018, 6:37 AM

Wouldn't we be better off avoiding affecting generic targets and just adding tuned FADD/FSUB/FMUL costs to getArithmeticInstrCost in X86TargetTransformInfo.cpp?

I'm not opposed to making that change (and it seems clear that we need to make other x86 cost model changes), but I think this one is independent of that. This is just supposed to make the default cost less necessary to require an override.

The fact that it helps the motivating (PR36280) case that led me here appears to be coincidence. I'm not understanding something fundamental about how SLP uses this cost to decide profitability, but I'll try to sort that out in D42981 or later.

@fhahn Do you have any ARM/AARCH64 concerns? Otherwise, I'm happy with this change.

fhahn added a comment.Feb 19 2018, 5:36 AM

I think this change should be fine for modern AArch64 chips. For example, Cortex-A72 has 2 integer and 2 fp units. Other modern cores have 3 integer units and 2 fp units. For both, those settings should be more realistic. I can't run any benchmarks at the moment though, as I am travelling.

RKSimon accepted this revision.Feb 19 2018, 6:47 AM

LGTM - as you've said in the comment I'd still like us to look into improving target specific cost values instead of relying on these defaults.

This revision is now accepted and ready to land.Feb 19 2018, 6:47 AM
fhahn accepted this revision.Feb 19 2018, 6:53 AM
spatel updated this revision to Diff 134920.Feb 19 2018, 8:04 AM

Rebase. The test that previously regressed in X86/insert-element-build-vector.ll does not anymore, so that's good - probably due to D42657.

This revision was automatically updated to reflect the committed changes.
eastig added a subscriber: eastig.Feb 20 2018, 8:33 AM

Hi Sanjay,

The patch caused regressions in the LLVM benchmarks and in Spec2k/Spec2k6 benchmarks on AArch64 Cortex-A53:

SingleSource/Benchmarks/Misc/matmul_f64_4x4: 49%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt: 5.32%
CFP2000/188.ammp/188.ammp: 3.58%
CFP2000/177.mesa/177.mesa: 2.48%
CFP2006/444.namd/444.namd: 2.49%

The regression of SingleSource/Benchmarks/Misc/matmul_f64_4x4 can also be seen on the public bot: http://lnt.llvm.org/db_default/v4/nts/90636
It is 128.85%.

The main difference in generated code is FMUL(FP, scalar) instead of FMUL(SIMD, scalar):

fmul d20, d16, d2

instead of

fmul v17.2d, v1.2d, v5.2d

This also caused code size increase: 6.04% in SingleSource/Benchmarks/Misc/matmul_f64_4x4

I am working on a reproducer.

Thanks,
Evgeny Astigeevich

Hi Sanjay,

The patch caused regressions in the LLVM benchmarks and in Spec2k/Spec2k6 benchmarks on AArch64 Cortex-A53:

SingleSource/Benchmarks/Misc/matmul_f64_4x4: 49%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt: 5.32%
CFP2000/188.ammp/188.ammp: 3.58%
CFP2000/177.mesa/177.mesa: 2.48%
CFP2006/444.namd/444.namd: 2.49%

The regression of SingleSource/Benchmarks/Misc/matmul_f64_4x4 can also be seen on the public bot: http://lnt.llvm.org/db_default/v4/nts/90636
It is 128.85%.

The main difference in generated code is FMUL(FP, scalar) instead of FMUL(SIMD, scalar):

fmul d20, d16, d2

instead of

fmul v17.2d, v1.2d, v5.2d

This also caused code size increase: 6.04% in SingleSource/Benchmarks/Misc/matmul_f64_4x4

I am working on a reproducer.

Thanks. We knew this change was likely to cause perf regressions based on some of the x86 diffs, so having those reductions will help tune the models in general and specifically for AArch64.

Ie, we should be able to solve the AArch64 problems with AArch64-specific cost model changes rather than reverting this. For example as @fhahn mentioned, we might want to make the int-to-FP ratio 3:2 for some cores. Another possibility is overriding the fmul/fsub/fadd AArch64 costs to be more realistic (as we also probably have to do for x86).

anemet added a subscriber: anemet.Feb 20 2018, 4:34 PM

Hi Sanjay,

The patch caused regressions in the LLVM benchmarks and in Spec2k/Spec2k6 benchmarks on AArch64 Cortex-A53:

SingleSource/Benchmarks/Misc/matmul_f64_4x4: 49%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt: 5.32%
CFP2000/188.ammp/188.ammp: 3.58%
CFP2000/177.mesa/177.mesa: 2.48%
CFP2006/444.namd/444.namd: 2.49%

The regression of SingleSource/Benchmarks/Misc/matmul_f64_4x4 can also be seen on the public bot: http://lnt.llvm.org/db_default/v4/nts/90636
It is 128.85%.

The main difference in generated code is FMUL(FP, scalar) instead of FMUL(SIMD, scalar):

fmul d20, d16, d2

instead of

fmul v17.2d, v1.2d, v5.2d

This also caused code size increase: 6.04% in SingleSource/Benchmarks/Misc/matmul_f64_4x4

I am working on a reproducer.

Thanks. We knew this change was likely to cause perf regressions based on some of the x86 diffs, so having those reductions will help tune the models in general and specifically for AArch64.

Ie, we should be able to solve the AArch64 problems with AArch64-specific cost model changes rather than reverting this. For example as @fhahn mentioned, we might want to make the int-to-FP ratio 3:2 for some cores. Another possibility is overriding the fmul/fsub/fadd AArch64 costs to be more realistic (as we also probably have to do for x86).

Please revert until these things get worked out so that we can properly track performance. We are seeing many regressions including 17% on 444.namd and 12% on 482.sphinx3 in SPECfp 2006.

Please revert until these things get worked out so that we can properly track performance. We are seeing many regressions including 17% on 444.namd and 12% on 482.sphinx3 in SPECfp 2006.

Hi Adam -

Rather than reverting for all targets, can we just hack ARM/AArch with patches like this (if we can add at least one test to show what changed, that would be better of course) :

Index: lib/Target/ARM/ARMTargetTransformInfo.cpp
===================================================================
--- lib/Target/ARM/ARMTargetTransformInfo.cpp	(revision 325579)
+++ lib/Target/ARM/ARMTargetTransformInfo.cpp	(working copy)
@@ -514,6 +514,13 @@
   int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
                                            Opd1PropInfo, Opd2PropInfo);
 
+  // Assume that floating point arithmetic operations cost twice as much as
+  // integer operations.
+  // FIXME: This is a win on several perf benchmarks running on CPU model ???,
+  // but there are no regression tests that show why or how this is good.
+  if (Ty->isFPOrFPVectorTy())
+    Cost *= 2;
+
   // This is somewhat of a hack. The problem that we are facing is that SROA
   // creates a sequence of shift, and, or instructions to construct values.
   // These sequences are recognized by the ISel and have zero-cost. Not so for

Please revert until these things get worked out so that we can properly track performance. We are seeing many regressions including 17% on 444.namd and 12% on 482.sphinx3 in SPECfp 2006.

Hi Adam -

Rather than reverting for all targets, can we just hack ARM/AArch with patches like this (if we can add at least one test to show what changed, that would be better of course) :

Index: lib/Target/ARM/ARMTargetTransformInfo.cpp
===================================================================
--- lib/Target/ARM/ARMTargetTransformInfo.cpp	(revision 325579)
+++ lib/Target/ARM/ARMTargetTransformInfo.cpp	(working copy)
@@ -514,6 +514,13 @@
   int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
                                            Opd1PropInfo, Opd2PropInfo);
 
+  // Assume that floating point arithmetic operations cost twice as much as
+  // integer operations.
+  // FIXME: This is a win on several perf benchmarks running on CPU model ???,
+  // but there are no regression tests that show why or how this is good.
+  if (Ty->isFPOrFPVectorTy())
+    Cost *= 2;
+
   // This is somewhat of a hack. The problem that we are facing is that SROA
   // creates a sequence of shift, and, or instructions to construct values.
   // These sequences are recognized by the ISel and have zero-cost. Not so for

Seeing such major swings, my preference would be to revert and put the new version up for review (I think that your hack works). Then commit the new combined version after a few days so that the perf bots got a chance to recover. What do you think?

Adam

spatel reopened this revision.Feb 20 2018, 5:49 PM

Seeing such major swings, my preference would be to revert and put the new version up for review (I think that your hack works). Then commit the new combined version after a few days so that the perf bots got a chance to recover. What do you think?

Ok, this was too ambitious. Reverted at rL325658 and reopened:
https://bugs.llvm.org/show_bug.cgi?id=36280

This revision is now accepted and ready to land.Feb 20 2018, 5:49 PM
spatel planned changes to this revision.Feb 20 2018, 5:49 PM

@fhahn wrote (but it didn't transfer here):

I have not thought this through fully yet, but couldn't we use the scheduling model to get the number of units available
for certain instructions for backends using the machine scheduler? And determine the throughput based on that?

Yes, I thought about that too. We already use the sched model to get instruction latency and to let the unroller know the size of the reorder buffer, so using that for throughputs and/or uops shouldn't be too hard.

But we'll have to check how exactly these costs are being used. By using a more realistic cost model, we may get into more trouble because clients may have bent the costs to match their own cost formula rather than an accurate hardware model.

Hi Sanjay,

I attached a regression test to https://bugs.llvm.org/show_bug.cgi?id=36280 .

Thanks,
Evgeny Astigeevich

Seeing such major swings, my preference would be to revert and put the new version up for review (I think that your hack works). Then commit the new combined version after a few days so that the perf bots got a chance to recover. What do you think?

Ok, this was too ambitious. Reverted at rL325658 and reopened:
https://bugs.llvm.org/show_bug.cgi?id=36280

Thanks! I can confirm that our perf numbers have recovered.

spatel abandoned this revision.May 2 2018, 3:13 PM

Abandoning. I think we can reach the goal of more accurate cost models with target-specific and CPU-specific steps like D46276.