This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add subtarget features prefer-avx256 and prefer-avx128 and use them to limit vector width presented by TTI
AbandonedPublic

Authored by craig.topper on Nov 2 2017, 3:20 PM.

Download Raw Diff

Details

Reviewers

RKSimon
zvi
DavidKreitzer
andreadb
echristo
hfinkel

Summary

This patch is the first patch based on the RFC sent to llvm-dev to enable support for a prefered vector width for the vectorizer. I'll also submit a clang patch shortly to hook it up to the driver.

RFC: http://lists.llvm.org/pipermail/llvm-dev/2017-November/118734.html

This stores the preference as an enum in the subtarget in order of increasing strictness because the autogenerated subtarget code needs to be able max the encoding values if both prefer-avx128 and prefer-avx256 are specified in any order.

From initial experiments there's still more work to do to prevent zmm register usage, but this gives us a baseline and plumbing that we can build on.

What's the best way to test the register width output from TTI in a lit test?

Diff Detail

Event Timeline

craig.topper created this revision.Nov 2 2017, 3:20 PM

RKSimon added a reviewer: andreadb.Nov 2 2017, 3:32 PM

This looks reasonable to me, Craig. For testing, the obvious thing would be to test the effect on the LoopVectorizer, since it is the primary consumer of this information.

lib/Target/X86/X86Subtarget.h
64	It might be worth expanding upon this a bit to make is clear why Prefer128 is "stricter" than Prefer256. I have no objection to this choice of ordering, but it does make the code at lines 535-539 read a little funny.

Copying a note I sent to the dev-list; I assume we should just continue the discussion here now that we have a patch to review:

Why make this an x86-specific "feature" of the target? We already have options like this in LoopVectorize.cpp:

static cl::opt<unsigned> ForceTargetNumVectorRegs(

"force-target-num-vector-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of vector registers."));

Can we add an equivalent target-independent override for vector width? Any target with >1 potential register width will benefit from having this option for experimentation.

In the longer term we may end up making -mprefer-avx256 be on by default for skylake-avx256 so I think a vectorizer specific command line option would prevent us from having that control.

Even with this current patch I'm still seeing larger shuffles from the interleave store part of the vectorizer. I think that code was designed to create oversized vectors with the assumption that the backend would legalize it. Not sure if we're going to need to stop the vectorizer from doing that or if we're going to have to teach the backend to split 512-bit shuffles when this flag is on even though the type is "legal".

I'm also seeing gathers with a 8 x i32 type and an 8 x i64 index. I'm suspecting the index is i64 because in IR we need it to match the pointer size, but I'm not sure. I'm hoping some known bits analysis will allow us to narrow this back down in the backend, but if not we may have to split it when -mprefer-avx256 is on.

craig.topper added inline comments.Nov 3 2017, 10:30 AM

lib/Target/X86/X86Subtarget.h
64	It's not really a choice, the tablegen generated code only knows how to handle boolean features and features that represent a level. (like the SSElevel). For the level case it always does a max. This way if a CPU specifies a particular SSE level and the command line specifies a higher one the command line will be able to override it. Though in those cases we make the higher SSE level imply the lower ones so if you say no sse2 we disable all the sse levels 2 and above. I didn't do that here. So I'm not sure what would happen for the vector levels. I think I need to rethink this a bit.

In D39575#915319, @craig.topper wrote:

In the longer term we may end up making -mprefer-avx256 be on by default for skylake-avx256 so I think a vectorizer specific command line option would prevent us from having that control.

I have to look out how the plumbing works again, but I didn't follow that reasoning. Let me explain how I see this and suggest alternatives to an x86 feature flag.

By exposing this as a clang command-line option and telling programmers to recompile their code and test perf themselves, we admit that neither the hardware nor the compiler can always dynamically or statically determine the optimum vector size, but the hardware and compiler should get this right most of the time (maybe we can even model the warm-up cost of turning on the vector unit?). Therefore, we should give them the flexibility to use this option very selectively. Ideally, it would be controlled by a function-level or loop-level pragma.

If that's the right description, then we should model this as metadata or a function attribute. Is there something from OpenMP pragmas that can be adapted? If we need the backend to know the preference, we have precedence for that situation too: see the function attribute for square root estimates.

Should we be looking at this from a cost model POV? Possibly introducing the concept of a cost for "first use" of a vector type (in this case for 512-bit vectors), increasing the cost of 512-bit ops and making the cost models more 'state aware' (use of a 512-bit vector causes a cost multiplier effect on ALL vector ops).

Having said that the vectorizers don't seem to do a good job of comparing costs of different vector widths - AVX1 (Jaguar/Bulldozer/SandyBridge etc.) often end up with 256-bit integer vector code despite the fact that the costs already flag the x4 cost compared to 128-bit integer equivalents (and cause nasty register spill issues).

In D39575#916063, @RKSimon wrote:

Should we be looking at this from a cost model POV? Possibly introducing the concept of a cost for "first use" of a vector type (in this case for 512-bit vectors), increasing the cost of 512-bit ops and making the cost models more 'state aware' (use of a 512-bit vector causes a cost multiplier effect on ALL vector ops).

It's not clear to me that this can be a local decision. It's not just the frequency effect on the core in question, it's also the power draw and how that effects the speed of everything else.

Having said that the vectorizers don't seem to do a good job of comparing costs of different vector widths - AVX1 (Jaguar/Bulldozer/SandyBridge etc.) often end up with 256-bit integer vector code despite the fact that the costs already flag the x4 cost compared to 128-bit integer equivalents (and cause nasty register spill issues).

Not to get too far off topic, but could you elaborate somewhere? Are there bug reports? If it's 4x the cost, and only 2x the width, I'm surprised that we'd get that wrong (assuming that's true for most of the instructions in the loop). I'm curious whether is a deficiency with the register-pressure estimation heuristic in the vectorizer (which matters only for interleaving, but perhaps that's part of the problem?).

In D39575#916093, @hfinkel wrote:

Not to get too far off topic, but could you elaborate somewhere? Are there bug reports? If it's 4x the cost, and only 2x the width, I'm surprised that we'd get that wrong (assuming that's true for most of the instructions in the loop). I'm curious whether is a deficiency with the register-pressure estimation heuristic in the vectorizer (which matters only for interleaving, but perhaps that's part of the problem?).

I'll create a proper bug for this, but an example is in some umin reduction code I'm working on: https://godbolt.org/g/Cb1Crb - with as many subvector extract/inserts ops as there are vpmin calls

Repalce by D41096

Revision Contents

Path

Size

lib/

Target/

X86/

X86.td

7 lines

X86Subtarget.h

13 lines

X86Subtarget.cpp

1 line

X86TargetTransformInfo.cpp

4 lines

Diff 121390

lib/Target/X86/X86.td

	Show First 20 Lines • Show All 287 Lines • ▼ Show 20 Lines

	// Sandy Bridge and newer processors have many instructions that can be			// Sandy Bridge and newer processors have many instructions that can be
	// fused with conditional branches and pass through the CPU as a single			// fused with conditional branches and pass through the CPU as a single
	// operation.			// operation.
	def FeatureMacroFusion			def FeatureMacroFusion
	: SubtargetFeature<"macrofusion", "HasMacroFusion", "true",			: SubtargetFeature<"macrofusion", "HasMacroFusion", "true",
	"Various instructions can be fused with conditional branches">;			"Various instructions can be fused with conditional branches">;

				def FeaturePreferAVX128
				: SubtargetFeature<"prefer-avx128", "X86PreferVecWidth", "Prefer128",
				"Prefer 128-bit AVX instructions in the vectorizer">;
				def FeaturePreferAVX256
				: SubtargetFeature<"prefer-avx256", "X86PreferVecWidth", "Prefer256",
				"Prefer 256-bit AVX instructions in the vectorizer">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// X86 processors supported.			// X86 processors supported.
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	include "X86Schedule.td"			include "X86Schedule.td"

	def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",			def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
	"Intel Atom processors">;			"Intel Atom processors">;
	▲ Show 20 Lines • Show All 672 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	protected:
enum X86SSEEnum {		enum X86SSEEnum {
NoSSE, SSE1, SSE2, SSE3, SSSE3, SSE41, SSE42, AVX, AVX2, AVX512F		NoSSE, SSE1, SSE2, SSE3, SSSE3, SSE41, SSE42, AVX, AVX2, AVX512F
};		};

enum X863DNowEnum {		enum X863DNowEnum {
NoThreeDNow, MMX, ThreeDNow, ThreeDNowA		NoThreeDNow, MMX, ThreeDNow, ThreeDNowA
};		};

		enum X86PreferVecWidthEnum {
		// NOTE: Strictest constraint must have highest encoding.
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions It might be worth expanding upon this a bit to make is clear why Prefer128 is "stricter" than Prefer256. I have no objection to this choice of ordering, but it does make the code at lines 535-539 read a little funny. DavidKreitzer: It might be worth expanding upon this a bit to make is clear why Prefer128 is "stricter" than…
		craig.topperAuthorUnsubmitted Not Done Reply Inline Actions It's not really a choice, the tablegen generated code only knows how to handle boolean features and features that represent a level. (like the SSElevel). For the level case it always does a max. This way if a CPU specifies a particular SSE level and the command line specifies a higher one the command line will be able to override it. Though in those cases we make the higher SSE level imply the lower ones so if you say no sse2 we disable all the sse levels 2 and above. I didn't do that here. So I'm not sure what would happen for the vector levels. I think I need to rethink this a bit. craig.topper: It's not really a choice, the tablegen generated code only knows how to handle boolean features…
		PreferAny, Prefer256, Prefer128
		};

enum X86ProcFamilyEnum {		enum X86ProcFamilyEnum {
Others,		Others,
IntelAtom,		IntelAtom,
IntelSLM,		IntelSLM,
IntelGLM,		IntelGLM,
IntelHaswell,		IntelHaswell,
IntelBroadwell,		IntelBroadwell,
IntelSkylake,		IntelSkylake,
Show All 11 Lines	protected:
const TargetMachine &TM;		const TargetMachine &TM;

/// SSE1, SSE2, SSE3, SSSE3, SSE41, SSE42, or none supported.		/// SSE1, SSE2, SSE3, SSSE3, SSE41, SSE42, or none supported.
X86SSEEnum X86SSELevel;		X86SSEEnum X86SSELevel;

/// MMX, 3DNow, 3DNow Athlon, or none supported.		/// MMX, 3DNow, 3DNow Athlon, or none supported.
X863DNowEnum X863DNowLevel;		X863DNowEnum X863DNowLevel;

		/// Prefer 128-bit, 256-bit, or no preference.
		X86PreferVecWidthEnum X86PreferVecWidth;

/// True if the processor supports X87 instructions.		/// True if the processor supports X87 instructions.
bool HasX87;		bool HasX87;

/// True if this processor has conditional move instructions		/// True if this processor has conditional move instructions
/// (generally pentium pro+).		/// (generally pentium pro+).
bool HasCMov;		bool HasCMov;

/// True if the processor supports X86-64 instructions.		/// True if the processor supports X86-64 instructions.
▲ Show 20 Lines • Show All 421 Lines • ▼ Show 20 Lines	public:
bool hasDQI() const { return HasDQI; }		bool hasDQI() const { return HasDQI; }
bool hasBWI() const { return HasBWI; }		bool hasBWI() const { return HasBWI; }
bool hasVLX() const { return HasVLX; }		bool hasVLX() const { return HasVLX; }
bool hasPKU() const { return HasPKU; }		bool hasPKU() const { return HasPKU; }
bool hasMPX() const { return HasMPX; }		bool hasMPX() const { return HasMPX; }
bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }		bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }
bool hasCLWB() const { return HasCLWB; }		bool hasCLWB() const { return HasCLWB; }

		// NOTE: Width preferences are encoded as strictest constraint having highest
		// value.
		bool preferAVX128() const { return X86PreferVecWidth >= Prefer128; }
		bool preferAVX128or256() const { return X86PreferVecWidth >= Prefer256; }

bool isXRaySupported() const override { return is64Bit(); }		bool isXRaySupported() const override { return is64Bit(); }

X86ProcFamilyEnum getProcFamily() const { return X86ProcFamily; }		X86ProcFamilyEnum getProcFamily() const { return X86ProcFamily; }

/// TODO: to be removed later and replaced with suitable properties		/// TODO: to be removed later and replaced with suitable properties
bool isAtom() const { return X86ProcFamily == IntelAtom; }		bool isAtom() const { return X86ProcFamily == IntelAtom; }
bool isSLM() const { return X86ProcFamily == IntelSLM; }		bool isSLM() const { return X86ProcFamily == IntelSLM; }
bool useSoftFloat() const { return UseSoftFloat; }		bool useSoftFloat() const { return UseSoftFloat; }
▲ Show 20 Lines • Show All 147 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines	if (X86ProcFamily == IntelSkylake \|\| hasAVX512())
GatherOverhead = 2;		GatherOverhead = 2;
if (hasAVX512())		if (hasAVX512())
ScatterOverhead = 2;		ScatterOverhead = 2;
}		}

void X86Subtarget::initializeEnvironment() {		void X86Subtarget::initializeEnvironment() {
X86SSELevel = NoSSE;		X86SSELevel = NoSSE;
X863DNowLevel = NoThreeDNow;		X863DNowLevel = NoThreeDNow;
		X86PreferVecWidth = PreferAny;
HasX87 = false;		HasX87 = false;
HasCMov = false;		HasCMov = false;
HasX86_64 = false;		HasX86_64 = false;
HasPOPCNT = false;		HasPOPCNT = false;
HasSSE4A = false;		HasSSE4A = false;
HasAES = false;		HasAES = false;
HasFXSR = false;		HasFXSR = false;
HasXSAVE = false;		HasXSAVE = false;
▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	if (Vector && ST->hasAVX512())
return 32;		return 32;
return 16;		return 16;
}		}
return 8;		return 8;
}		}

unsigned X86TTIImpl::getRegisterBitWidth(bool Vector) const {		unsigned X86TTIImpl::getRegisterBitWidth(bool Vector) const {
if (Vector) {		if (Vector) {
if (ST->hasAVX512())		if (ST->hasAVX512() && !ST->preferAVX128or256())
return 512;		return 512;
if (ST->hasAVX())		if (ST->hasAVX() && !ST->preferAVX128())
return 256;		return 256;
if (ST->hasSSE1())		if (ST->hasSSE1())
return 128;		return 128;
return 0;		return 0;
}		}

if (ST->is64Bit())		if (ST->is64Bit())
return 64;		return 64;
▲ Show 20 Lines • Show All 2,691 Lines • Show Last 20 Lines