Page MenuHomePhabricator

[LV] Take overhead of run-time checks into account during vectorization.
Needs ReviewPublic

Authored by ebrevnov on Dec 5 2019, 2:28 AM.

Details

Summary

Currently vectorizer rejects short trip count loops in presence of run-time checks. That is caused by inability of cost model to take an overhead caused by run-time checks into account. While ideal solution would be to calculate entire cost of VPlan there is no infrastructure in place for that. I see to possibilities to mitigate that.

  1. Make a dry-run of SCEV expansion during main phase of cost modeling. Thus Instead of real SCEV expansion we will calculate cost of such expansion.
  2. Defer overhead calculation for run-time checks and final decision till after run-time checks expansion. If overhead turns out to make vectorization not profitable make unconditional bypass of vectorized loop and relay on the following optimizations to remove trivially dead code.

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Please note loops vectorized before will remain vectorized after the change in the same way. Only short trip count loops not vectorized before may potentially get vectorized (if proved profitable by cost model). Thus the change has pretty narrow scope and is not expected to have wide performance impact (neither positive nor negative).

Main motivation for the change is performance gain on a benchmark which is based on publicly available math library for linear algebra (https://github.com/fommil/netlib-java). The following simple loop over DSPR method from Netlib shows +55% gain on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz:

for (int i = 0; i < 1000; i++) {

blas.dspr("l", 9, 1.0, vectorD, 1, result);

}

I measured performance impact on llvm test-suite but I don't know how to read the results. I ran original and patched compilers 3 times each and results are unstable. Even for consequent runs of original compiler I observe +100/-100% variation. This is a special machine dedicated for performance testing and no other applications were running.

Besides I evaluated performance on SpecJVM98, Dacapo9.12 and Netlib . All measurements were performed on OS:Linux, CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. No performance changes detected. I can share numbers if anyone wants to see them.

Note the feature is disabled by default for now. It is going to be enabled downstream first to give it extensive testing on wider range of real world applications.

Event Timeline

ebrevnov created this revision.Dec 5 2019, 2:28 AM
Herald added a project: Restricted Project. · View Herald Transcript

Is this change inspired by a real world case? If so, how relevant / pervasive is this case?

Also, how different are the run time checks from each other?

If we assume most of them end up being an additional comparison, sometimes with pointer dereference, and assuming the values will be already in registers (as they're about to be looped over), I think their costs end up being roughly the same, no?

Or am I missing something obvious?

Thanks!
--renato

Is this change inspired by a real world case? If so, how relevant / pervasive is this case?

Yes, this case came from the real world benchmark. This change gives +15% on it.

Also, how different are the run time checks from each other?

If we assume most of them end up being an additional comparison, sometimes with pointer dereference, and assuming the values will be already in registers (as they're about to be looped over), I think their costs end up being roughly the same, no?

Or am I missing something obvious?

Run-time checks may involve much more than just comparison and some pointer arithmetic. Here is an example how typical overflow check looks like:

vector.scevcheck: ; preds = %tc.check

%mul = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %1, i32 %indvar)
%mul.result = extractvalue { i32, i1 } %mul, 0
%mul.overflow = extractvalue { i32, i1 } %mul, 1
%54 = add i32 %loop-predication.iv, %mul.result
%55 = sub i32 %loop-predication.iv, %mul.result
%56 = icmp ugt i32 %55, %loop-predication.iv
%57 = icmp ult i32 %54, %loop-predication.iv
%58 = select i1 false, i1 %56, i1 %57
%59 = or i1 %58, %mul.overflow
%60 = or i1 false, %59
%mul268 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %2, i32 %indvar)
%mul.result269 = extractvalue { i32, i1 } %mul268, 0
%mul.overflow270 = extractvalue { i32, i1 } %mul268, 1
%61 = add i32 %arg4, %mul.result269
%62 = sub i32 %arg4, %mul.result269
%63 = icmp ugt i32 %62, %arg4
%64 = icmp ult i32 %61, %arg4
%65 = select i1 false, i1 %63, i1 %64
%66 = or i1 %65, %mul.overflow270
%67 = or i1 %60, %66
br i1 %67, label %scalar.ph, label %vector.memcheck

Thanks
Evgeniy

Thanks!
--renato

Yes, this case came from the real world benchmark. This change gives +15% on it.

Well, 15% on an unnamed benchmark on an unnamed architecture shouldn't be a reason to change a core functionality of the vectoriser.

The first step is to show that, on LLVM's own test-suite (in benchmarking mode), multiple programs are changed and the effect is largely positive.

But you also need to make sure that the change has no significant negative effect on any of the benchmarks people care about, on the major architectures supported by LLVM.

Obvious candidates are SPEC, x86_64 and Arm, but it'd be nice if the architecture owners can chime in with their own acks.

Be aware of geomeans. Large regressions can be hidden by a number of smaller improvements, and that's a bad direction to go. The other way is also worrying, where one benchmark shows massive improvements and lots of others show regressions. Benchmarks tend to be very specific to one task and their numbers are artificial, whereas regressions on unrelated code means you've added a base cost to *all other* programs.

With this patch, this is specially true. You need to make sure that the cost you're adding is "insignificant" overall, which I'm finding it hard to prove.

Run-time checks may involve much more than just comparison and some pointer arithmetic. Here is an example how typical overflow check looks like:

My point wasn't how small or big they are. But how constant.

If that's the usual size of a range check, then it almost certainly eclipses the benefits of small loop vectorisation in the majority of cases.

But if the checks range from one ICMP to a range of MUL, ADD, etc., then it gets harder to predict the check size, and static assumptions will be hard.

Bear in mind that the following golden rule of compilers apply: Whenever something is hard to do it statically, we do it conservatively. To make it smarter, we *must* prove it's still safe in the general case.

Today we're in the former state, so, you need to prove your change is safe in the general case.

cheers,
--renato

ebrevnov retitled this revision from [LV] Take overhead of run-time checks into account during vectorization. to [WIP][LV] Take overhead of run-time checks into account during vectorization..Dec 10 2019, 3:57 AM
ebrevnov updated this revision to Diff 236998.Thu, Jan 9, 2:44 AM

Introduce control flag for the feature and disable by default + Rebase.

ebrevnov edited the summary of this revision. (Show Details)Thu, Jan 9, 4:37 AM
ebrevnov edited the summary of this revision. (Show Details)Thu, Jan 9, 4:42 AM
ebrevnov retitled this revision from [WIP][LV] Take overhead of run-time checks into account during vectorization. to [LV] Take overhead of run-time checks into account during vectorization..
ebrevnov added reviewers: Ayal, hsaito, fhahn, anna.

Hi Evgeniy,

Hiding it behind a flag default to false is a good idea. Also, now I understand what the benchmark and the architecture were, thanks!

I think the idea is not wholly unreasonable, especially needing a flag to run (for now). This will help people on other architectures test, too.

However, I'm not comfortable reviewing the code, as this seems pretty invasive and VPlan is changing constantly. People already added to the reviewers, and the ones I'm adding now, should have a look and review, too.

About the test-suite, it's not just running it, you need to run in Benchmark mode. It's not trivial, so I'm hoping people can help you with that. At least check standard C++ benchmarks to see if there's any unexpected negative impact.

Thanks!
--renato