I've started looking into the performance of wasm vs native code using a small selection of the LLVM test suite, focusing on kernels and the basic stuff that could be expected in algorithms. The performance of tight kernels was particularly bad in V8 due to the inserted stack guard checks to enable any loop iteration to be interrupted. Loop unrolling seemed like the sensible thing to do to reduce this overhead, but this benefit would need to overcome the added binary size and associated compilation time. The results for wasmtime and node, using wasi-sdk, are shown below:
AdobeC++
Polybench
TSVC
This patch enables loop unrolling by default for wasm, using a threshold of 30 as the results suggest the gains generally tail off after this, as the binary size also increases. Polybench on the Raspberry Pi via wasmtime doesn't share the performance uplift like the other benchmarks on the other platforms, but I'm not sure why.
Is this because loops with calls are presumed to not be hot enough to be worth unrolling?