This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable the post-RA-scheduler for 32-bit cpus
ClosedPublic

Authored by mbodart on Apr 14 2016, 4:11 PM.

Details

Summary

This change set enables the post RA scheduler for pentium4 and SSE3 class cpus.

The intent is that a vanilla "clang -m32 -O2" compilation, with no specific
-march setting, will get post scheduled. This has demonstrated significant
performance improvements when the code is run on Silvermont, with
essentially neutral performance on avx2 systems.

Some silvermont highlights include over 20% improvements
in 456.hmmer and several EEMBC benchmarks, and 4 to 15%
improvements in over a dozen other industry benchmarks.
There were only a few drops, in the 2 to 4% range.

On an AVX2 system, performance was generally flat, with
a balance of smallish gains and losses (in the 2 to 4% range).

The key scheduling improvement is that loads get separated
from their users.

As coded, the change should not affect "clang -m64" compilations.

With -m32, clang currently defaults to -mcpu=pentium4.
So I changed the scheduling model for pentium4 from the
GenericModel, to a new GenericPostRAModel (same properties,
but also enables the PostRAScheduler).

As clang's default CPU could theoretically change, I also
made the same change for "nearby" cpus pentium-m, pentium4m,
prescott and nocona. Arguably yonah should be in this set
as well, for consistency. But I'd like to get feedback on
whether this is an OK approach overall, as changing yonah will
affect many lit tests. I suspect most of these older cpus
are no longer used in practice, but don't really know.

Diff Detail

Event Timeline

mbodart updated this revision to Diff 53798.Apr 14 2016, 4:11 PM
mbodart retitled this revision from to [X86] Enable the post-RA-scheduler for 32-bit cpus.
mbodart updated this object.
mbodart added reviewers: nadav, aaboud.
mbodart added a subscriber: llvm-commits.
mbodart updated this object.Apr 20 2016, 8:38 AM
mbodart added a reviewer: spatel.
aaboud added inline comments.Apr 21 2016, 9:06 AM
lib/Target/X86/X86.td
290

Can you create a class for GenereicPostRAModel, similar to class Proc, something like this:

class ProcPostRA<string Name, list<SubtargetFeature> Features>
: ProcessorModel<Name, GenereicPostRAModel, Features>;

test/CodeGen/X86/pr16360.ll
2

Why did not you simply add the "-post-RA-scheduler=false" flag like in other tests?

mbodart added inline comments.Apr 21 2016, 12:12 PM
lib/Target/X86/X86.td
290

I think that might add more confusion than clarity.

There are currently two base classes used to define the cpus, Proc and ProcessorModel.
ProcessorModel is used in cases where we want to override the scheduling model.
Post-RA scheduling is just one property of the scheduling model.

In the cases where we do create a new class derived from ProcessorModel, it is done to encapsulate the processor "features" in one place.

So my preference is to keep the current approach, though it's a minor enough detail
that I can change it if others feel strongly.

test/CodeGen/X86/pr16360.ll
2

As a general rule I would think we want to avoid adding special options to tests,
as that can sometimes mask issues. But there are some exceptions.

I disabled the post-RA-scheduler in misched-ilp.ll because that test is looking for
a specific behavior from an earlier scheduling phase.

As for machine-cp.ll, I could go either way (updating the test or disabling the post scheduler).
The test is built with the x86_64 triple and nocona cpu. If built by clang with an x86_64 triple,
the post scheduler would not be enabled as the cpu would be set to "x86_64", not nocona.
So I arbitrarily chose to match this behavior. But I could simply update the test as well.
Is there any precedence or BKM here?

spatel added inline comments.Apr 25 2016, 1:14 PM
test/CodeGen/X86/pr16360.ll
2

I don't think this is actually written anywhere as an official guideline, but yes, I think we should prefer to only use special options when they are supposed to affect the outcome of the test. That gives us better test coverage for the common case.

There's a huge pile of x86 tests that excessively specify a CPU model when they really should specify an attribute (for example as seen in this patch, -mcpu=pentium4 rather than -mattr=sse2). We should try to fix those as we encounter them.

So my vote would be to update all of the affected test files with the flags that they are really testing (these would be test-file-only commits ahead of this one). It will make this patch smaller and the intended effect of this patch will be made clear rather than confused by a bunch of unintentional changes because test files were poorly specified.

mbodart added inline comments.Apr 26 2016, 4:36 PM
test/CodeGen/X86/pr16360.ll
2

Thanks for the suggestions Sanjay.

I created D19568 for the test changes, and will remove them from this review.

But note that the new test still needs use of -mcpu. The scheduler behavior
is tied to the cpu, not -mattr, AFAICT. If you know of a better mechanism for
verifying the post-scheduling behavior for a vanilla "clang -m32 -O2" compilation,
I'd be happy to use it.

mbodart updated this revision to Diff 55305.Apr 27 2016, 1:32 PM

Rebased, and removed several lit tests from this change set as I updated them separately.

spatel accepted this revision.Apr 27 2016, 3:27 PM
spatel edited edge metadata.

LGTM.

A couple of options to make sure the behavior is as expected:

  1. To confirm the codegen difference with -post-RA-scheduler, you could add a RUN (and corresponding CHECKs) that disables -post-RA-scheduler for one of the tested CPUs. Or add a run line for a newer CPU that does not have -post-RA-scheduler turned on by default.
  1. To confirm that the default i386 CPU model has -post-RA-scheduler enabled, you could add a RUN line that doesn't explicitly specify the CPU. Then if someone comes along and changes the default i386 target to something not in the current list, it should cause this test to fail.
This revision is now accepted and ready to land.Apr 27 2016, 3:27 PM

Thanks for the testing suggestions Sanjay, but I think I'll stick with the current test for now.

Regarding 1), that would only test that some other scheduling paradigm did not separate the loads, which may or may not be a desired behavior, and may change from time to time. So I prefer to limit the test to getting the expected behavior in the known settings where it is desired.

As for 2), the default cpu varies by tool. With clang -m32 it is pentium4 and with clang -m64 it is x86_64,
but with llc it is generic. So the test would indeed fail if I added an llc RUN case with no explicit cpu.

This revision was automatically updated to reflect the committed changes.