Page MenuHomePhabricator

Improve loop unrolling performance on T99
ClosedPublic

Authored by steleman on Nov 30 2017, 5:19 PM.

Details

Summary

This patch improves performance on T99 as shown here (libquantum 0.2.4):

https://docs.google.com/spreadsheets/d/1Lo1o2E1NjrpkwS7DvYYWsiVvPdd93h7KBaqeptMrZPY/edit?usp=sharing

By increasing the LoopMicroOpsBufferSize in the T99 Scheduler file, loop unrolling becomes more aggressive. This helps performance on T99.

Test case included.

Diff Detail

Repository
rL LLVM

Event Timeline

steleman created this revision.Nov 30 2017, 5:19 PM

128 is a really big number for LoopMicroOpBufferSize. You might want to consider modifying AArch64TTIImpl::getUnrollingPreferences with some more tailored heuristics.

Also, it would be nice to see the impact across a wider set of benchmarks, like the LLVM testsuite, so it's clear what impact more aggressive unrolling has in general.

128 is a really big number for LoopMicroOpBufferSize.

It appears to be the optimal number for T99, after having spent a lot of time moving it up and down in steps of 2 and testing the effects.

Larger than 128 doesn't improve performance or increase unrolling -- at least on SPECcpu2017, libquantum and Google Protobufs, on T99. Smaller than 128 impacts performance - 128 is always better than anything smaller than.

What would be nice is being able to set this parameter as a compile-time -mllvm -aarch64-loop-micro-ops-buffer-size option.

I have not tested other AArch64 micro-arch's, as I have no access to them.

You might want to consider modifying AArch64TTIImpl::getUnrollingPreferences with some more tailored heuristics.

That's something I would be happy to take a look at, but I am reluctant, for now, to make changes in AArch64TTIImpl::getUnrollingPreferences. That's a more involved change.

Also, it would be nice to see the impact across a wider set of benchmarks, like the LLVM testsuite, so it's clear what impact more aggressive unrolling has in general.

I make no claims that every single ISA or AArch64 micro-arch will benefit from increasing their LoopMicroOpsBufferSize. This is a micro-arch specific change for T99.

Also, for this T99 specific change, the LLVM testsuite probably isn't the best benchmark. The specific type of loops that benefit most from this change are loops that contain a large number of nested conditionals. There are many loops of this type in SPECcpu2017 and in libquantum (quantum_toffoli). I'm not sure this type of deeply-nested loop is that widespread in the LLVM testsuite.

efriedma accepted this revision.Nov 30 2017, 7:11 PM

LGTM.

Larger than 128 doesn't improve performance or increase unrolling -- at least on SPECcpu2017, libquantum and Google Protobufs, on T99.

For benchmarking, just wanted to make sure you did some wider testing; I don't care about the LLVM testsuite specifically.

That's something I would be happy to take a look at, but I am reluctant, for now, to make changes in AArch64TTIImpl::getUnrollingPreferences. That's a more involved change.

Sure, that's fine.

This revision is now accepted and ready to land.Nov 30 2017, 7:11 PM
fhahn added a subscriber: fhahn.Dec 1 2017, 1:27 AM
fhahn accepted this revision.Dec 1 2017, 7:57 AM

LGTM, as long as you are happy with the speedup.

This revision was automatically updated to reflect the committed changes.