This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Enable loop unrolling
ClosedPublic

Authored by samparker on Jan 21 2021, 3:57 AM.

Details

Summary

I've started looking into the performance of wasm vs native code using a small selection of the LLVM test suite, focusing on kernels and the basic stuff that could be expected in algorithms. The performance of tight kernels was particularly bad in V8 due to the inserted stack guard checks to enable any loop iteration to be interrupted. Loop unrolling seemed like the sensible thing to do to reduce this overhead, but this benefit would need to overcome the added binary size and associated compilation time. The results for wasmtime and node, using wasi-sdk, are shown below:

AdobeC++

Polybench

TSVC

This patch enables loop unrolling by default for wasm, using a threshold of 30 as the results suggest the gains generally tail off after this, as the binary size also increases. Polybench on the Raspberry Pi via wasmtime doesn't share the performance uplift like the other benchmarks on the other platforms, but I'm not sure why.

Diff Detail

Event Timeline

samparker created this revision.Jan 21 2021, 3:57 AM
samparker requested review of this revision.Jan 21 2021, 3:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 21 2021, 3:57 AM

cc @kripken what do you think about the size-performance trade off here?

llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
108

Is this because loops with calls are presumed to not be hot enough to be worth unrolling?

Very interesting perf data!

Overall it seems like a speedup of around 10% for a size cost of 10% (with much more on one case, showing outliers are possible). That is significant and would support enabling it when not optimizing for size.

However, wasm VMs may eventually do loop unrolling themselves, so the speedup here may evaporate over time, while the code size cost will not. That, and the possibility of outliers on code size, worry me. So personally I would suggest considering not doing this for now, and approaching the VM people with the data, as motivation for them to do it.

Overall my personal philosophy is that anything that regresses code size should preferably be done in the wasm VM. But there are exceptions of course, like inlining, and maybe this is another.

Thanks for taking a look!

While I agree that we should be wary of code size increases, it would seem a shame to me to move the burden into all the runtimes when LLVM is the shared resource and a very good place to do unrolling and the subsequent simplifications that it enables. I liken this to vectorization, but this can give performance to targets even when they haven't spent the time implementing their SIMD code generators in the various runtimes. I've had a quick look at how vectorization stacks up against unrolling on my ryzen machine:

TSVC is a great example of the worst case scenario in unrolling code size increases, it's a single file consisting of 100+ loops that iterate through fixed sized arrays. For vectorization, I suspect that many of vectorized loops neither require a scalar remainder or runtime pointer checks and so hide most of the code size increases that usually come with vectorization.

llvm/lib/Target/WebAssembly/WebAssemblyTargetTransformInfo.cpp
108

I think the general reason is because a call will be a scheduling barrier so can limit the ILP gains of the unroll, this implementation is basically copied from BasicTTI. But from my basic understanding of V8, calls can also be used as interrupt points and so that extra logic doesn't have to be inserted into loops with calls either.

@dschuff, do you have an opinion about this?

I'm inclined to merge this and leave it to users for whom the code size increase is unacceptable to use -Os/-Oz/-fno-unroll-loops to disable it. I'd be curious to see the code size and perf impacts are on the Emscripten benchmark suite, too.

I'm inclined to merge this and leave it to users for whom the code size increase is unacceptable to use -Os/-Oz/-fno-unroll-loops to disable it.

Is it possible to do this in -O3 but not -O2 or -O1?

I'd be curious to see the code size and perf impacts are on the Emscripten benchmark suite, too.

Good idea! Several real-world codebases there that are easy to measure for size and speed. Let me know if I can help.

I'm inclined to merge this and leave it to users for whom the code size increase is unacceptable to use -Os/-Oz/-fno-unroll-loops to disable it.

The unrolling would be disabled at -Os and -Oz, see

// Avoid unrolling when optimizing for size.
UP.OptSizeThreshold = 0;
UP.PartialOptSizeThreshold = 0;

Is it possible to do this in -O3 but not -O2 or -O1?

AFAIK, loop unrolling is disabled at -O1 for all targets. I don't think there's currently a backend mechanism to make a decision on the opt level, but the loop unroller uses them to adjust thresholds to be more aggressive in some situations.

The unrolling would be disabled at -Os and -Oz, see

AFAIK, loop unrolling is disabled at -O1 for all targets. I don't think there's currently a backend mechanism to make a decision on the opt level, but the loop unroller uses them to adjust thresholds to be more aggressive in some situations.

I see, thanks samparker.

What concerns me is that some existing users will definitely be harmed by this, if the default for -O2 changes. That's the default in many build systems, so we will be getting bug reports on "why is my code larger", I bet.

With that said, if I am the only holdout, then I won't stand in the way. But if we do this I'd recommend we at least update changelogs and docs in preparation, to try to minimize the annoyance for users.

But if we do this I'd recommend we at least update changelogs and docs in preparation, to try to minimize the annoyance for users.

Sure and users should definitely be recommended to compile at -Os if code size is their most important metric. From my, admittedly limited, benchmarking the difference between the optimisation levels is quite negligible, yet -Os can reduce size without suffering the performance impact often associated with -Oz.

Webassembly and runtimes is not my field, just saying so you can take my remarks with a bit (or a lot) of salt, but I do have a few opinions on loop-unrolling. :-)
I look at this problem slightly differently:

What concerns me is that some existing users will definitely be harmed by this, if the default for -O2 changes.

I.e., the way I look at this is that we deliver on of its promises:

WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.

If this is the goal, if we want parity with native execution, that means we would have to apply the same tricks as ahead of time compiled code, otherwise we would never achieve this. Loop-unrolling is known to greatly improve performance, and also to be an enabler for other optimisations. Thus, without loop-unrolling I don't think near native execution speeds will be feasible.

I agree that a speedup of around 10% for a size cost of 10% is a good trade-off. But it doesn't mean users will get this. I.e., the perf data is shown for a few selected, compute intensive loops, but may not be representative for existing code (not yet). One possibility is that people see minor code-size increases, but large perf gains, another is that there is no increase in code size or improvement in performance, it all depends. One way to find out and quantify this, if necessary, is to check an existing and representative code-base.

So this sounds like a good proposal to me, but I would turn it in something positive:

But if we do this I'd recommend we at least update changelogs and docs in preparation, to try to minimize the annoyance for users.

I.e., some good changes and good performance improvements are coming! :-) If people would like to opt-out, then are a few possibilities:

  • Use optimisation level -Os for that makes a more sensitive decision about the performance vs. code-size trade-off.
  • Use a compiler flag -fno-unroll-loops, but I don't know how to set that in this environment. Probably that could be set with some environment of make variable?

If deployment is the issue, then are a few other strategies. A switch can be made, with a possibility to opt-out, see above.
Or people can buy into out, if they e.g. set a EXPERIMENTAL_OPTIMIZATIONS flag/value/option, thus get a taste of the perf uplifts if they are curious, and setting out the clear expectations that this will become the default at some point.

Just my 2 cents.

Webassembly and runtimes is not my field, just saying so you can take my remarks with a bit (or a lot) of salt, but I do have a few opinions on loop-unrolling. :-)
I look at this problem slightly differently:

What concerns me is that some existing users will definitely be harmed by this, if the default for -O2 changes.

I.e., the way I look at this is that we deliver on of its promises:

WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms.

If this is the goal, if we want parity with native execution, that means we would have to apply the same tricks as ahead of time compiled code, otherwise we would never achieve this. Loop-unrolling is known to greatly improve performance, and also to be an enabler for other optimisations. Thus, without loop-unrolling I don't think near native execution speeds will be feasible.

Well said, I agree.

But I think the question is where loop unrolling is done: on the wasm producer (LLVM) or the wasm consumer (the VM that runs it). As mentioned earlier VMs are working on this, and the benefit of them doing it is that code size matters quite a bit on wasm, and doing it there would be better.

But I think the question is where loop unrolling is done: on the wasm producer (LLVM) or the wasm consumer (the VM that runs it).

Ok, thanks, fair enough.

As mentioned earlier VMs are working on this, and the benefit of them doing it is that code size matters quite a bit on wasm, and doing it there would be better.

Like I said, I am not from this field, but I don't see according to which criteria this is better:

  • In terms of implementation, I guess letting the LLVM based producer do the loop-unrolling is better as loop-unrolling is a non-trivial optimisation (to do this properly) to (re)implement, this is leveraging the existing LLVM infrastructure, and enabling this is a few lines of code which much beat reimplementing loop-unrolling in a consumer.
  • Generated code-quality: assuming 2 loop-unrollers do their job equally well, there shouldn't be any performance difference. Unless the runtime uses profile information that is fed back or has more target information.
  • Code-size: this should be the same in the end? I.e., for people interested in maximum performance, doing unrolling in the producer shouldn't increase code-size compared to when this done at the consumer side. For folks that want better code-size, there are different knobs to steer this.
  • Compile times: doing loop-unrolling at the consumer side will impact start-up or install times?
  • Availability of loop-unrolling. Doing the transformation in the wasm consumer (once) avoids needing it in the different producers and would thus be better. That is, if there is 1 consumer. If there are multiple consumer implementations, then there is no difference?

As mentioned earlier VMs are working on this, and the benefit of them doing it is that code size matters quite a bit on wasm, and doing it there would be better.

Like I said, I am not from this field, but I don't see according to which criteria this is better:

  • In terms of implementation, I guess letting the LLVM based producer do the loop-unrolling is better as loop-unrolling is a non-trivial optimisation (to do this properly) to (re)implement, this is leveraging the existing LLVM infrastructure, and enabling this is a few lines of code which much beat reimplementing loop-unrolling in a consumer.

Fair point. Note however that LLVM is not the only producer, and some do not do loop unrolling. But as you said lower down, some VMs might not do it, etc. so definitely there are tradeoffs here.

  • Code-size: this should be the same in the end? I.e., for people interested in maximum performance, doing unrolling in the producer shouldn't increase code-size compared to when this done at the consumer side. For folks that want better code-size, there are different knobs to steer this.

Are you thinking of machine code size? That matters too, but the bigger concern is wasm binary size, which is what is downloaded. Doing unrolling after the download is better for that.

  • Compile times: doing loop-unrolling at the consumer side will impact start-up or install times?

Yes, that is a definitely a downside, and it is why we usually prefer to optimize in the producer. Optimizations that increase code size, however, are where the tradeoffs become more interesting, like here.

SjoerdMeijer added a comment.EditedJan 26 2021, 9:29 AM

Are you thinking of machine code size? That matters too, but the bigger concern is wasm binary size, which is what is downloaded. Doing unrolling after the download is better for that.

Yep, I was thinking about the transformed/unrolled wasm, or the machine code, that has to be the same similar. But yes, agreed of course that the input, the downloaded wasm, will be smaller compared to unrolling by the producer.

Thanks for clarifying and elaborating on this. It wasn't clear to me from previous comments what exactly was meant with "regressing code-size", and that it is not only about the code-size/perf ratio (the 10% code for 10% speed) and whether that is good enough, but also that it is about the download size. That's clear now.

I will now leave the rest of this discussion to the wasm professionals. :-) Cheers.

Gentle ping... and I'd like to clarify my point about code size: Wasm is actually no different to any other backend in that respect, where the user will can dictate whether it is the most important factor through optimisation levels. The ARM backend is a good example of this where we spend a considerable amount of time tuning our microcontrollers for both -Oz and -O3. The usage of optimisation levels nearly always requires some education, but I think this is easier than re-implementing loop unrolling elsewhere and/or letting users figure out how to enable sensible unrolling via the command line.

I ran Emscripten's larger benchmarks with this patch applied and these are the results:

The results are great on coremark but more modest on the other benchmarks. Interestingly, loop unrolling has no effect at all on lzma or poppler. It's also good to see that the increases in the compressed sizes are uniformly smaller than the increases in the uncompressed sizes.

Thanks @tlively. I just took a look at lzma and yes, it looks like it should benefit from unrolling... What runtime are you running on? I've just done a few runs on wasmtime and, frankly, the benchmark looks far too noisey to be useful: my execution times are varying between ~6.5-8 seconds!

And to qualify that, native execution runs vary by 10s of msec and -fno-unroll-loops is ~9% slower.

My node results are far more stable than wasmtime, on both x86_64 and aarch64 with and without liftoff. Those results correlate with yours and show no benefit of loop unrolling, but I'm seeing ~4% binary size increase too. So I'm surprised and confused and I guess I will investigate further.

I've spent the day looking at lzma and my conclusions are that some loops are vectorized and/or unrolled but they're not on the hot path, and now I believe the improvement I saw on native was purely from secondary side effects. Loop unrolled code is ~4% larger and -msimd128 enabled is ~2.5%, so I'm still surprised you're not seeing change there.

tlively accepted this revision.Feb 9 2021, 2:26 PM

Sorry to leave you hanging on this, @samparker! This LGTM, but in case V8 or other engines land loop unrolling in the future, we might want to re-evaluate this.

This revision is now accepted and ready to land.Feb 9 2021, 2:26 PM
This revision was landed with ongoing or failed builds.Feb 10 2021, 12:26 AM
This revision was automatically updated to reflect the committed changes.

Thanks! Yes, agreed. I've also added a couple of tests before committing, just to check that unrolling is indeed disabled at -Os and -Oz.