This is an archive of the discontinued LLVM Phabricator instance.

Disable loop unrolling in loop vectorization pass when VF is 1 on x86
ClosedPublic

Authored by wmi on May 5 2015, 10:36 PM.

Details

Summary

The patch is to fix the problem described in https://llvm.org/bugs/show_bug.cgi?id=23217

Loop unrolling in loop vectorization pass has two kinds of benefits: 1. For loop which needs to be both vectorized and unrolled, the unrolling integrated with loop vectorization pass can generate less prologue/epilogue code. 2. unrolling in loop vectorization generates memory boundary check for unrolled loop version, which is useful for better scheduling on some architectures.

However, for x86, its performance is not very sensitive to compile time scheduling. So unrolling in loop vectorization when VF==1 will introduce extra cost of overflow check, memory boundary check and sometimes extra prologue/epilogue code when regular unroller will unroll the loop another time. These are harmful for performance on x86.

The patch disabled unrolling in loop vectorization pass when VF==1 on x86 architecture, by setting MaxInterleaveFactor to 1.

Performance neutral for spec2000. Google internal benchmarks: detection improved by 5% on sandybridge and 9% on westmere, saw improved by 1.5% on both platforms.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 25008.May 5 2015, 10:36 PM
wmi retitled this revision from to Disable loop unrolling in loop vectorization pass when VF is 1 on x86.
wmi updated this object.
wmi edited the test plan for this revision. (Show Details)
wmi added a reviewer: hfinkel.
wmi set the repository for this revision to rL LLVM.
wmi added a subscriber: Unknown Object (MLST).

Hi Wei,

The example you have shown would produce bad vectorized code on any architecture, I don't think anything you said (multiple unrolling and prologue loops) would make much difference on other archs. Maybe you're trying to fix a global problem locally, and creating some unnecessary constraints for the cases that do work.

However, your performance improvements are really impressive, so I think we ought to check other archs, and maybe try to detect the problematic case on a generic level?

cheers,
--renato

hfinkel accepted this revision.May 6 2015, 9:01 AM
hfinkel edited edge metadata.

Hi Wei,

The example you have shown would produce bad vectorized code on any architecture, I don't think anything you said (multiple unrolling and prologue loops) would make much difference on other archs. Maybe you're trying to fix a global problem locally, and creating some unnecessary constraints for the cases that do work.

However, your performance improvements are really impressive, so I think we ought to check other archs, and maybe try to detect the problematic case on a generic level?

The problem is fairly generic, but does need per-target tuning. The problem is that interleaving can be quite beneficial for VF == 1 for in-order chips with fairly-long pipelines (especially for floating point). But on those architectures, you end up unrolling a lot to get good performance. On X86, the constraints are different. On X86 we can't unroll a lot (in some sense), because you're unrolling to fill the loop-stream detectors's associated dispatch buffer, and there is a large performance cliff if you make the loop not fit into the buffer. This all unrolling is minor and the extra prologues really hurt a lot.

This LGTM.

cheers,
--renato

This revision is now accepted and ready to land.May 6 2015, 9:01 AM

ok, cheers!

This revision was automatically updated to reflect the committed changes.