This is an archive of the discontinued LLVM Phabricator instance.

[LV] Take overhead of run-time checks into account during vectorization.
Needs ReviewPublic

Authored by ebrevnov on Dec 5 2019, 2:28 AM.

Details

Summary

Currently vectorizer rejects short trip count loops in presence of run-time checks. That is caused by inability of cost model to take an overhead caused by run-time checks into account. While ideal solution would be to calculate entire cost of VPlan there is no infrastructure in place for that. I see to possibilities to mitigate that.

  1. Make a dry-run of SCEV expansion during main phase of cost modeling. Thus Instead of real SCEV expansion we will calculate cost of such expansion.
  2. Defer overhead calculation for run-time checks and final decision till after run-time checks expansion. If overhead turns out to make vectorization not profitable make unconditional bypass of vectorized loop and relay on the following optimizations to remove trivially dead code.

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Please note loops vectorized before will remain vectorized after the change in the same way. Only short trip count loops not vectorized before may potentially get vectorized (if proved profitable by cost model). Thus the change has pretty narrow scope and is not expected to have wide performance impact (neither positive nor negative).

Main motivation for the change is performance gain on a benchmark which is based on publicly available math library for linear algebra (https://github.com/fommil/netlib-java). The following simple loop over DSPR method from Netlib shows +55% gain on Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz:

for (int i = 0; i < 1000; i++) {

blas.dspr("l", 9, 1.0, vectorD, 1, result);

}

I measured performance impact on llvm test-suite but I don't know how to read the results. I ran original and patched compilers 3 times each and results are unstable. Even for consequent runs of original compiler I observe +100/-100% variation. This is a special machine dedicated for performance testing and no other applications were running.

Besides I evaluated performance on SpecJVM98, Dacapo9.12 and Netlib . All measurements were performed on OS:Linux, CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. No performance changes detected. I can share numbers if anyone wants to see them.

Note the feature is disabled by default for now. It is going to be enabled downstream first to give it extensive testing on wider range of real world applications.

Event Timeline

ebrevnov created this revision.Dec 5 2019, 2:28 AM
Herald added a project: Restricted Project. · View Herald Transcript

Is this change inspired by a real world case? If so, how relevant / pervasive is this case?

Also, how different are the run time checks from each other?

If we assume most of them end up being an additional comparison, sometimes with pointer dereference, and assuming the values will be already in registers (as they're about to be looped over), I think their costs end up being roughly the same, no?

Or am I missing something obvious?

Thanks!
--renato

Is this change inspired by a real world case? If so, how relevant / pervasive is this case?

Yes, this case came from the real world benchmark. This change gives +15% on it.

Also, how different are the run time checks from each other?

If we assume most of them end up being an additional comparison, sometimes with pointer dereference, and assuming the values will be already in registers (as they're about to be looped over), I think their costs end up being roughly the same, no?

Or am I missing something obvious?

Run-time checks may involve much more than just comparison and some pointer arithmetic. Here is an example how typical overflow check looks like:

vector.scevcheck: ; preds = %tc.check

%mul = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %1, i32 %indvar)
%mul.result = extractvalue { i32, i1 } %mul, 0
%mul.overflow = extractvalue { i32, i1 } %mul, 1
%54 = add i32 %loop-predication.iv, %mul.result
%55 = sub i32 %loop-predication.iv, %mul.result
%56 = icmp ugt i32 %55, %loop-predication.iv
%57 = icmp ult i32 %54, %loop-predication.iv
%58 = select i1 false, i1 %56, i1 %57
%59 = or i1 %58, %mul.overflow
%60 = or i1 false, %59
%mul268 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 %2, i32 %indvar)
%mul.result269 = extractvalue { i32, i1 } %mul268, 0
%mul.overflow270 = extractvalue { i32, i1 } %mul268, 1
%61 = add i32 %arg4, %mul.result269
%62 = sub i32 %arg4, %mul.result269
%63 = icmp ugt i32 %62, %arg4
%64 = icmp ult i32 %61, %arg4
%65 = select i1 false, i1 %63, i1 %64
%66 = or i1 %65, %mul.overflow270
%67 = or i1 %60, %66
br i1 %67, label %scalar.ph, label %vector.memcheck

Thanks
Evgeniy

Thanks!
--renato

Yes, this case came from the real world benchmark. This change gives +15% on it.

Well, 15% on an unnamed benchmark on an unnamed architecture shouldn't be a reason to change a core functionality of the vectoriser.

The first step is to show that, on LLVM's own test-suite (in benchmarking mode), multiple programs are changed and the effect is largely positive.

But you also need to make sure that the change has no significant negative effect on any of the benchmarks people care about, on the major architectures supported by LLVM.

Obvious candidates are SPEC, x86_64 and Arm, but it'd be nice if the architecture owners can chime in with their own acks.

Be aware of geomeans. Large regressions can be hidden by a number of smaller improvements, and that's a bad direction to go. The other way is also worrying, where one benchmark shows massive improvements and lots of others show regressions. Benchmarks tend to be very specific to one task and their numbers are artificial, whereas regressions on unrelated code means you've added a base cost to *all other* programs.

With this patch, this is specially true. You need to make sure that the cost you're adding is "insignificant" overall, which I'm finding it hard to prove.

Run-time checks may involve much more than just comparison and some pointer arithmetic. Here is an example how typical overflow check looks like:

My point wasn't how small or big they are. But how constant.

If that's the usual size of a range check, then it almost certainly eclipses the benefits of small loop vectorisation in the majority of cases.

But if the checks range from one ICMP to a range of MUL, ADD, etc., then it gets harder to predict the check size, and static assumptions will be hard.

Bear in mind that the following golden rule of compilers apply: Whenever something is hard to do it statically, we do it conservatively. To make it smarter, we *must* prove it's still safe in the general case.

Today we're in the former state, so, you need to prove your change is safe in the general case.

cheers,
--renato

ebrevnov retitled this revision from [LV] Take overhead of run-time checks into account during vectorization. to [WIP][LV] Take overhead of run-time checks into account during vectorization..Dec 10 2019, 3:57 AM
ebrevnov updated this revision to Diff 236998.Jan 9 2020, 2:44 AM

Introduce control flag for the feature and disable by default + Rebase.

ebrevnov edited the summary of this revision. (Show Details)Jan 9 2020, 4:37 AM
ebrevnov edited the summary of this revision. (Show Details)Jan 9 2020, 4:42 AM
ebrevnov retitled this revision from [WIP][LV] Take overhead of run-time checks into account during vectorization. to [LV] Take overhead of run-time checks into account during vectorization..
ebrevnov added reviewers: Ayal, hsaito, fhahn, anna.

Hi Evgeniy,

Hiding it behind a flag default to false is a good idea. Also, now I understand what the benchmark and the architecture were, thanks!

I think the idea is not wholly unreasonable, especially needing a flag to run (for now). This will help people on other architectures test, too.

However, I'm not comfortable reviewing the code, as this seems pretty invasive and VPlan is changing constantly. People already added to the reviewers, and the ones I'm adding now, should have a look and review, too.

About the test-suite, it's not just running it, you need to run in Benchmark mode. It's not trivial, so I'm hoping people can help you with that. At least check standard C++ benchmarks to see if there's any unexpected negative impact.

Thanks!
--renato

Саn anyone take a look at this? Thanks in advance!

In the commit message:

I see to possibilities to mitigate that.

to -> two

bypass of vectorized loop and relay on the following optimizations to remove trivially dead code.

relay->rely?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4909

I think that "forced" check should be done before this, and if it is forced, no further checks needed. With that, conrol flow within if branches should simplify a lot.

4962

nit: whitespace after "with" missing.

6540

Maybe rename C -> Blocks?

6591

Can be just CM.BestVF.Width > 1, the != 0 part will be simply implied.

6613

Assert that it was at least 1 before it?

6638

Also assert that it is conditional? This fact seems to be used below on line 6766.

Just nits from me, I don't know this part well enough to approve.

Thanks Maxim for looking at this! I would be really nice to hear from anyone who is well familiar with the code.

Evgeniy

fhahn added a comment.Feb 16 2020, 2:24 PM

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Sorry if I missed an earlier discussion about this, but why do you think that expanding the RT checks is very problematic in practice? Is it the cost of expanding the RT checks early (and maybe throwing them away later)?
I am currently experimenting with this approach to see if it is feasible to use size estimates to better decide on the number of runtime checks (including mem ptr checks), regardless of the vectorisation mode.

The approach in this patch seems to delay & interleave parts of cost-modelling with codegen and adds quite a bit of additional plumbing through the codebase to do so.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing. Something related also came up in https://bugs.llvm.org/show_bug.cgi?id=44662 . I am not yet sure what better alternatives are there though.

ebrevnov added a comment.EditedFeb 16 2020, 9:25 PM

@fhahn , Thanks for taking the time to watch!

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Sorry if I missed an earlier discussion about this, but why do you think that expanding the RT checks is very problematic in practice? Is it the cost of expanding the RT checks early (and maybe throwing them away later)?
I am currently experimenting with this approach to see if it is feasible to use size estimates to better decide on the number of runtime checks (including mem ptr checks), regardless of the vectorisation mode.

You slightly misread the approach suggested in 1). The wild idea was to not to generate any instructions at all. The suggestion is to make dry-run of SCEV but instead of generating code we could evaluate cost of instructions requested to be built. To make it less invasive we could for example pass custom Builder which will do that (not really sure if this is possible though). Another issue is that TTI still needs an instruction to evaluate cost.

IIUC you suggest to actually built IR generated by SCEVExpander for run-time checks, evaluate it's cost and then disregard if not needed. I would be more than happy if we could do that. But I believe current infrastructure doesn't allow to generate side IR to play with and then disregard. AFAIK this limitation was the main motivation for VPlan introduction.

The approach in this patch seems to delay & interleave parts of cost-modelling with codegen and adds quite a bit of additional plumbing through the codebase to do so.

Agree. It will be really great if we can find something more straightforward. Just to note that some of complexity of the current patch is necessity to support both current and new ways of doing cost modeling. In addition at the moment current patch targets short trip count loops only. We could streamline the code a bit more if make it default for all loops.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing. Something related also came up in https://bugs.llvm.org/show_bug.cgi?id=44662 . I am not yet sure what better alternatives are there though.

Agree. Unfortunately don't have a good idea either. Just a note that we currently disregard this for trip count check which is always generated for non constant trip count loops.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing.

I think the general assumption is that these runtime checks will never fail in typical programs and only have to be added to not miscompile edge cases (array aliasing, unsigned overflow, MIN_INT).

fhahn added a comment.Mar 11 2020, 4:12 AM

@fhahn , Thanks for taking the time to watch!

While 1) may look like a better approach I see it as very problematic in practice. That's why I decided to go with the 2).

Sorry if I missed an earlier discussion about this, but why do you think that expanding the RT checks is very problematic in practice? Is it the cost of expanding the RT checks early (and maybe throwing them away later)?
I am currently experimenting with this approach to see if it is feasible to use size estimates to better decide on the number of runtime checks (including mem ptr checks), regardless of the vectorisation mode.

You slightly misread the approach suggested in 1). The wild idea was to not to generate any instructions at all. The suggestion is to make dry-run of SCEV but instead of generating code we could evaluate cost of instructions requested to be built. To make it less invasive we could for example pass custom Builder which will do that (not really sure if this is possible though). Another issue is that TTI still needs an instruction to evaluate cost.

IIUC you suggest to actually built IR generated by SCEVExpander for run-time checks, evaluate it's cost and then disregard if not needed. I would be more than happy if we could do that. But I believe current infrastructure doesn't allow to generate side IR to play with and then disregard. AFAIK this limitation was the main motivation for VPlan introduction.

I finally had time to wrap up a proof-of-concept patch that creates the runtime checks up-front and drops them if not required: D75980. It seems to work quite well in practice.

The approach in this patch seems to delay & interleave parts of cost-modelling with codegen and adds quite a bit of additional plumbing through the codebase to do so.

Agree. It will be really great if we can find something more straightforward. Just to note that some of complexity of the current patch is necessity to support both current and new ways of doing cost modeling. In addition at the moment current patch targets short trip count loops only. We could streamline the code a bit more if make it default for all loops.

IIUC, the basic idea is to enable vectorisation , if the cost of runtime checks + vector cost < scalar cost? I think this might be too relaxed, as the cost of the RT checks have to be paid, even if they fail and we execute the scalar loop. And I guess the larger the number of checks, the larger the probability of one failing. Something related also came up in https://bugs.llvm.org/show_bug.cgi?id=44662 . I am not yet sure what better alternatives are there though.

Agree. Unfortunately don't have a good idea either. Just a note that we currently disregard this for trip count check which is always generated for non constant trip count loops.

Building on D75980 if have put up D75981 which allows larger trip counts if the cost of the trip count is less than 0.5% of the expected scalar cost of the loop. The threshold is quite arbitrarily chosen (and would need further analysis/benchmarking to tune), but it might be a viable general idea.

I think it would be relatively straight forward to implement the logic in this patch on top of D75980 as well, if there is agreement that building the RT checks up-front is preferable.