This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add subtarget features prefer-avx256 and prefer-avx128 and use them to limit vector width presented by TTI
AbandonedPublic

Authored by craig.topper on Nov 2 2017, 3:20 PM.

Details

Summary

This patch is the first patch based on the RFC sent to llvm-dev to enable support for a prefered vector width for the vectorizer. I'll also submit a clang patch shortly to hook it up to the driver.

RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-November/118734.html

This stores the preference as an enum in the subtarget in order of increasing strictness because the autogenerated subtarget code needs to be able max the encoding values if both prefer-avx128 and prefer-avx256 are specified in any order.

From initial experiments there's still more work to do to prevent zmm register usage, but this gives us a baseline and plumbing that we can build on.

What's the best way to test the register width output from TTI in a lit test?

Diff Detail

Event Timeline

craig.topper created this revision.Nov 2 2017, 3:20 PM
DavidKreitzer edited edge metadata.Nov 3 2017, 7:00 AM

This looks reasonable to me, Craig. For testing, the obvious thing would be to test the effect on the LoopVectorizer, since it is the primary consumer of this information.

lib/Target/X86/X86Subtarget.h
64

It might be worth expanding upon this a bit to make is clear why Prefer128 is "stricter" than Prefer256.

I have no objection to this choice of ordering, but it does make the code at lines 535-539 read a little funny.

spatel added a subscriber: spatel.Nov 3 2017, 7:26 AM

Copying a note I sent to the dev-list; I assume we should just continue the discussion here now that we have a patch to review:

Why make this an x86-specific "feature" of the target? We already have options like this in LoopVectorize.cpp:

static cl::opt<unsigned> ForceTargetNumVectorRegs(

"force-target-num-vector-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of vector registers."));

Can we add an equivalent target-independent override for vector width? Any target with >1 potential register width will benefit from having this option for experimentation.

In the longer term we may end up making -mprefer-avx256 be on by default for skylake-avx256 so I think a vectorizer specific command line option would prevent us from having that control.

Even with this current patch I'm still seeing larger shuffles from the interleave store part of the vectorizer. I think that code was designed to create oversized vectors with the assumption that the backend would legalize it. Not sure if we're going to need to stop the vectorizer from doing that or if we're going to have to teach the backend to split 512-bit shuffles when this flag is on even though the type is "legal".

I'm also seeing gathers with a 8 x i32 type and an 8 x i64 index. I'm suspecting the index is i64 because in IR we need it to match the pointer size, but I'm not sure. I'm hoping some known bits analysis will allow us to narrow this back down in the backend, but if not we may have to split it when -mprefer-avx256 is on.

craig.topper added inline comments.Nov 3 2017, 10:30 AM
lib/Target/X86/X86Subtarget.h
64

It's not really a choice, the tablegen generated code only knows how to handle boolean features and features that represent a level. (like the SSElevel). For the level case it always does a max. This way if a CPU specifies a particular SSE level and the command line specifies a higher one the command line will be able to override it.

Though in those cases we make the higher SSE level imply the lower ones so if you say no sse2 we disable all the sse levels 2 and above. I didn't do that here. So I'm not sure what would happen for the vector levels.

I think I need to rethink this a bit.

In the longer term we may end up making -mprefer-avx256 be on by default for skylake-avx256 so I think a vectorizer specific command line option would prevent us from having that control.

I have to look out how the plumbing works again, but I didn't follow that reasoning. Let me explain how I see this and suggest alternatives to an x86 feature flag.

By exposing this as a clang command-line option and telling programmers to recompile their code and test perf themselves, we admit that neither the hardware nor the compiler can always dynamically or statically determine the optimum vector size, but the hardware and compiler should get this right most of the time (maybe we can even model the warm-up cost of turning on the vector unit?). Therefore, we should give them the flexibility to use this option very selectively. Ideally, it would be controlled by a function-level or loop-level pragma.

If that's the right description, then we should model this as metadata or a function attribute. Is there something from OpenMP pragmas that can be adapted? If we need the backend to know the preference, we have precedence for that situation too: see the function attribute for square root estimates.

RKSimon edited edge metadata.Nov 5 2017, 4:50 AM

Should we be looking at this from a cost model POV? Possibly introducing the concept of a cost for "first use" of a vector type (in this case for 512-bit vectors), increasing the cost of 512-bit ops and making the cost models more 'state aware' (use of a 512-bit vector causes a cost multiplier effect on ALL vector ops).

Having said that the vectorizers don't seem to do a good job of comparing costs of different vector widths - AVX1 (Jaguar/Bulldozer/SandyBridge etc.) often end up with 256-bit integer vector code despite the fact that the costs already flag the x4 cost compared to 128-bit integer equivalents (and cause nasty register spill issues).

hfinkel edited edge metadata.Nov 5 2017, 7:01 AM

Should we be looking at this from a cost model POV? Possibly introducing the concept of a cost for "first use" of a vector type (in this case for 512-bit vectors), increasing the cost of 512-bit ops and making the cost models more 'state aware' (use of a 512-bit vector causes a cost multiplier effect on ALL vector ops).

It's not clear to me that this can be a local decision. It's not just the frequency effect on the core in question, it's also the power draw and how that effects the speed of everything else.

Having said that the vectorizers don't seem to do a good job of comparing costs of different vector widths - AVX1 (Jaguar/Bulldozer/SandyBridge etc.) often end up with 256-bit integer vector code despite the fact that the costs already flag the x4 cost compared to 128-bit integer equivalents (and cause nasty register spill issues).

Not to get too far off topic, but could you elaborate somewhere? Are there bug reports? If it's 4x the cost, and only 2x the width, I'm surprised that we'd get that wrong (assuming that's true for most of the instructions in the loop). I'm curious whether is a deficiency with the register-pressure estimation heuristic in the vectorizer (which matters only for interleaving, but perhaps that's part of the problem?).

Not to get too far off topic, but could you elaborate somewhere? Are there bug reports? If it's 4x the cost, and only 2x the width, I'm surprised that we'd get that wrong (assuming that's true for most of the instructions in the loop). I'm curious whether is a deficiency with the register-pressure estimation heuristic in the vectorizer (which matters only for interleaving, but perhaps that's part of the problem?).

I'll create a proper bug for this, but an example is in some umin reduction code I'm working on: https://godbolt.org/g/Cb1Crb - with as many subvector extract/inserts ops as there are vpmin calls

craig.topper abandoned this revision.Dec 14 2017, 9:45 AM

Repalce by D41096