This is an archive of the discontinued LLVM Phabricator instance.

[X86] Add pass to infer required-vector-width attribute based on size of function arguments and use of intrinsics
AbandonedPublic

Authored by craig.topper on Feb 14 2018, 11:24 AM.

Details

Summary

This patch adds a pass that determines if 512 bit registers are definitely needed based on the size of function arguments it detects. It explicitly ignores target independent intrinsics assuming that selection dag can gracefully split those.

I'm hoping what's here is enough to protect ABI, inline assembly, and intrinsic requirements. Anything else we need for correctness that I haven't thought of?

This doesn't help with any explicit vector code created in the original source code that get lowered to native IR without using in target specific intrinsics. We'll need to set the attribute from clang or an earlier IR pass to detect that. This pass is intended to get something testable so we can see if we can enable prefer vector width and get the legalizer to clip to 256 bit safely on real source code. Nothing will happen if -mprefer-vector-width=256 is not passed to clang.

Diff Detail

Event Timeline

craig.topper created this revision.Feb 14 2018, 11:24 AM

I'm really not a fan in general of this path - can we step back and figure out what we'd like to do with this first? I don't really agree with the need for require-vector-width in the first place.

@echristo, I know we talked some. I know you don't like the attribute, are you also against the legalizer limit?

Fix the name of the attribute to be 'required-vector-width' to match what's already in X86TargetMachine.cpp

lebedev.ri added inline comments.
lib/Target/X86/X86.h
123 ↗(On Diff #135322)

require*d*-vector-width

Fix the attribute name in one more place

Attempting to clarify some things.

Prior to any of my prefer vector width changes, the TTI interface informed the loop vectorizer about the maximum register width in hardware via the getRegisterBitWidth method. The loop vectorizer uses this information as a guideline, but not a strict limit of what types it can create. My understanding is that there are two modes, the default mode finds the largest scalar width of loads, stores, and phis within the loop. It then determines how many of this scalar type fit in the register width received from TTI. This determines the VF factor. The second mode, called maximize bandwidth, uses the smallest scalar width instead. I'm not sure if that's still restricted to loads, store, and phis or not.

So for both modes, if there are any larger scalar types in the loop, those types will be multiplied by the VF factor and cause the vectorizer to create types that are larger than the maximum hardware register size. These types will exist all the way to codegen, and it is up to the SelectionDAG type legalization process or target specific DAG combines to split these larger operations. In some cases the larger types are used to represent idioms that contain extends and truncates that we were able to combine to more complex instructions without doing a split. X86 probably has more pre type legalization target specific DAG combines here than most other targets. And we are doing our own splitting for type legalization in some of these combines. This prevents the type legalizer from needing to deal with target specific nodes. Though the type legalizer is capable of calling ReplaceNodeResults and LowerOperation for target specific nodes.

In addition to the vectorizer, I've believe I've seen evidence of InstCombine creating larger types when it canonicalizes getelementptrs to have pointer width types by inserting sign extends. This shows up in vector GEPs used by gather and scatter intrinsics. So if the GEP used a v16i32 index, InstCombine will turn it into v16i64. X86 tries to remove some of these extends with a DAG combine, but we're not great at finding a uniform base address for gathers which causes the extend to be hidden. So we end up spltting gathers when their result type is the maximum hardware register width, but their index is twice as large.

Based on this current behavior of IR passes with respect to hardware vector width and the fact that codegen already has to handle illegal types. So the first part of my changes(prefer-vector-width), make the vectorizer behavior similar between prefer-vector-width=256 on an avx512f target and an avx2 target by making getRegisterBitWidth return 256 in both cases. So under prefer-vector-width=256, the vectorizer will calculate the same VF factor it would for an AVX2 target, and it will still generate wider types for some things just like it does for AVX2. With just prefer-vector-width, when the IR gets to codegen we end up emitting 512 bit operations for these wider types because the type legalization process can't enforce prefer-vector-width. This is because codegen has no knowledge of the source of the vectors in the IR. They may have come from the vectorizer or they may have been present in the original source code. If they were in the original source code, they might be used by target specific intrinsics, or function parameters, or return types. In those cases we would need to ensure we emit 512 bit vectors or the intrinsics won't assemble or we could have mismatched ABI between two different translation units if one is compiled with prefer-vector-width and one isn't.

The patch presented here overcomes this limitation by providing a required-vector-width attribute that can be used to tell codegen that there is nothing in the IR that strictly requires wider vectors to avoid compilation failures or ABI mismatches. With this we can tell the type legalization process and the X86 DAG combines that only 256 bit vectors are legal. With this the type legalization process should be carried out in a very similar way to AVX2, splitting the wider types from the vectorizer in the same way. But we also gain the AVX512VL specific instructions like scatter and masked operations that aren't present in AVX2.

This patch isn't a perfect solution because it doesn't capture the source containing vectors that don't go through intrinsics or function arguments or return values. But we could emit the same function attribute earlier, in the frontend, to convey that information.

So based on my conversations with others there seem to be some questions about the baseline assumptions here.

-Should the vectorizer be creating wider vectors than what getRegisterBitWidth returns? Changing this likely means we would need to introduce splitting into the vectorizer to avoid hurting SSE/AVX/AVX2 vectorized code. But I don't think that removes the need for the backend to handle illegal types in IR so that would just be introducing another type legalization process.
-Some idioms require larger types to represent in IR, for example, average uses (trunc (add (add (zext X, zext Y), 1)) but we are able to do that in X86 with a single instruction. But the 512-bit version of that instruction requires the zexts be wider than 512 bits in IR.
-Should InstCombine be creating wide illegal vector types to canonicalize GEPs? Are there other places that do similar things? Other passes?
-Is X86 too aggressive about creating target specific nodes early requiring many places to do custom type legalization? The checks are getting centralized with SplitBinaryOpsAndApply so hopefully we won't have to think about all the different split sizes in multiple places going forward.

What I'm looking for is a way to guarantee that with prefer-vector-width=256, starting from scalar code and vectorizing it that we won't end up with 512 registers in the output. Even a few of them are enough to trigger than frequency reduction penalty I am trying to avoid. Given the current behavior of IR passes and their expectations of what codegen can handle, I don't see a way to make this guarantee from IR. And it would be easy to break in the future as new passes or passes are modified. The codegen type legalization process seems the best place to handle this. But we need to be able to communicate when it is safe to do so.

Would this patch be more acceptable in the near term if I do one of the following:
-Get ride of the required-vector-width attribute and have this pass force a prefer-vector-width attribute to be at least as large as any ABI or intrinsic constraints? Then make the X86 codegen use prefer-vector-width to control the type legalizer. This would mean some places in x86 codegen would create new 512-bit operations that weren't in the IR since we'd be overriding the preference.
-Rename the required-vector-width attribute to x86-required-vector-width so that it is clear it is x86 specific for now. And we could only given a generic name if we decide to emit it earlier than x86 codegen.

Either of these would allow people to start experimenting with prefer-vector-width=256. We would need to revisit and make changes before we could make prefer-vector-width=256 the default for SKX.

So for both modes, if there are any larger scalar types in the loop, those types will be multiplied by the VF factor and cause the vectorizer to create types that are larger than the maximum hardware register size. These types will exist all the way to codegen, and it is up to the SelectionDAG type legalization process or target specific DAG combines to split these larger operations. In some cases the larger types are used to represent idioms that contain extends and truncates that we were able to combine to more complex instructions without doing a split. X86 probably has more pre type legalization target specific DAG combines here than most other targets. And we are doing our own splitting for type legalization in some of these combines. This prevents the type legalizer from needing to deal with target specific nodes. Though the type legalizer is capable of calling ReplaceNodeResults and LowerOperation for target specific nodes.

This is interesting, and perhaps not what we should do.

In addition to the vectorizer, I've believe I've seen evidence of InstCombine creating larger types when it canonicalizes getelementptrs to have pointer width types by inserting sign extends. This shows up in vector GEPs used by gather and scatter intrinsics. So if the GEP used a v16i32 index, InstCombine will turn it into v16i64. X86 tries to remove some of these extends with a DAG combine, but we're not great at finding a uniform base address for gathers which causes the extend to be hidden. So we end up spltting gathers when their result type is the maximum hardware register width, but their index is twice as large.

This is perhaps also not what we should do with vector accessed operations.

Based on this current behavior of IR passes with respect to hardware vector width and the fact that codegen already has to handle illegal types. So the first part of my changes(prefer-vector-width), make the vectorizer behavior similar between prefer-vector-width=256 on an avx512f target and an avx2 target by making getRegisterBitWidth return 256 in both cases. So under prefer-vector-width=256, the vectorizer will calculate the same VF factor it would for an AVX2

The patch presented here overcomes this limitation by providing a required-vector-width attribute that can be used to tell codegen that there is nothing in the IR that strictly requires wider vectors to avoid compilation failures or ABI mismatches. With this we can tell the type legalization process and the X86 DAG combines that only 256 bit vectors are legal. With this the type legalization process should be carried out in a very similar way to AVX2, splitting the wider types from the vectorizer in the same way. But we also gain the AVX512VL specific instructions like scatter and masked operations that aren't present in AVX2.

I don't think this part is what we should do. We shouldn't change the size of a register, but rather change the optimizers to prefer smaller things. Legal versus preferred is a big difference.

What I'm looking for is a way to guarantee that with prefer-vector-width=256, starting from scalar code and vectorizing it that we won't end up with 512 registers in the output. Even a few of them are enough to trigger than frequency reduction penalty I am trying to avoid. Given the current behavior of IR passes and their expectations of what codegen can handle, I don't see a way to make this guarantee from IR. And it would be easy to break in the future as new passes or passes are modified. The codegen type legalization process seems the best place to handle this. But we need to be able to communicate when it is safe to do so.

I think it might be a lot of work, but possible. I think that we might abuse the legalizer here too much. For example, let's say I want to limit the vector size to 256, but in some specific code I use the intrinsics or other vector code to explicitly ask for 512-bit vectors. I should still be able to get that I think? Right now you'll say that the code isn't legal and try to lower it.

-eric

So for both modes, if there are any larger scalar types in the loop, those types will be multiplied by the VF factor and cause the vectorizer to create types that are larger than the maximum hardware register size. These types will exist all the way to codegen, and it is up to the SelectionDAG type legalization process or target specific DAG combines to split these larger operations. In some cases the larger types are used to represent idioms that contain extends and truncates that we were able to combine to more complex instructions without doing a split. X86 probably has more pre type legalization target specific DAG combines here than most other targets. And we are doing our own splitting for type legalization in some of these combines. This prevents the type legalizer from needing to deal with target specific nodes. Though the type legalizer is capable of calling ReplaceNodeResults and LowerOperation for target specific nodes.

This is interesting, and perhaps not what we should do.

FWIW, I really like the ability of passes to create larger-than-legal types and rely on common type legalization to split and fit them into useful vector registers. This proves very useful. That's only one part of what this paragraph talks about though, ond the rest I don't have a strong opinion about.

In addition to the vectorizer, I've believe I've seen evidence of InstCombine creating larger types when it canonicalizes getelementptrs to have pointer width types by inserting sign extends. This shows up in vector GEPs used by gather and scatter intrinsics. So if the GEP used a v16i32 index, InstCombine will turn it into v16i64. X86 tries to remove some of these extends with a DAG combine, but we're not great at finding a uniform base address for gathers which causes the extend to be hidden. So we end up spltting gathers when their result type is the maximum hardware register width, but their index is twice as large.

This is perhaps also not what we should do with vector accessed operations.

This does seem much more questionable. IMO, not because of the illegal vector types but because extending the scalar component seems much more troubling in a vector-gep context than a scalar-gep context, regardless of the arity of the vector.

Based on this current behavior of IR passes with respect to hardware vector width and the fact that codegen already has to handle illegal types. So the first part of my changes(prefer-vector-width), make the vectorizer behavior similar between prefer-vector-width=256 on an avx512f target and an avx2 target by making getRegisterBitWidth return 256 in both cases. So under prefer-vector-width=256, the vectorizer will calculate the same VF factor it would for an AVX2

The patch presented here overcomes this limitation by providing a required-vector-width attribute that can be used to tell codegen that there is nothing in the IR that strictly requires wider vectors to avoid compilation failures or ABI mismatches. With this we can tell the type legalization process and the X86 DAG combines that only 256 bit vectors are legal. With this the type legalization process should be carried out in a very similar way to AVX2, splitting the wider types from the vectorizer in the same way. But we also gain the AVX512VL specific instructions like scatter and masked operations that aren't present in AVX2.

I don't think this part is what we should do. We shouldn't change the size of a register, but rather change the optimizers to prefer smaller things. Legal versus preferred is a big difference.

I disagree here... I think type legalization is a necessary tool to meaningfully achieve a preferred type. I think we would end up building a mini legalizer elsewhere in the optimizer if we remove the ability of the current type legalizer logic to handle this, and that doesn't seem ideal. I also think there is a better way to split this hair....

What I'm looking for is a way to guarantee that with prefer-vector-width=256, starting from scalar code and vectorizing it that we won't end up with 512 registers in the output. Even a few of them are enough to trigger than frequency reduction penalty I am trying to avoid. Given the current behavior of IR passes and their expectations of what codegen can handle, I don't see a way to make this guarantee from IR. And it would be easy to break in the future as new passes or passes are modified. The codegen type legalization process seems the best place to handle this. But we need to be able to communicate when it is safe to do so.

I think this core requirement gets at the solution I would propose to this.

I think at the IR level we should simply replace the preferred width with required width. We should require the frontend to pick the width that it wants code generated for, and stick to it.

Then we can demand that the frontend, if it wants to explicitly generate 512-bit code through things like intrinsics, *increases* the required width for that function. This allows explicit AVX-512 _mm_* intrinsics to work correctly.

Then we teach the inliner how to cope with this intelligently. My guess is that this won't be trivial to do because of patterns like:

if (DynamicTestIfTargetHasAVX512()) {
  CallFunctionWithAVX512Instrinsics();
} else {
  CallFunctionWithLoopThatShouldBeAutoVectorizedTo256BitVectors();
}

Here, both naive solutions may prove unfortunate: increasing the callers required width in the IR to match the callees and allowing the inline might in theory regress the generated code for the other function, but blocking the inlining would almost certainly regress the code.

However, I think in practice inlining will be the correct call for the vast majority of cases. Most of the really interesting cases won't be inline candidates anyways due to different *subtargets* that already influences inlining decisions. So generally, I think it is safe to let the inliner take the max of the required widths and proceed to inline. If we find interesting cases that require more complex logic, we can address them once we have that test case.

One possible alternative is that we could make an explicit vs. implicit distinction and allow inlining to increase an implicit width but block inlining that would increase an explicitly narrow width. This might make it easier to write code like:

if (NumberOfItemsIsVeryLarge) {
  CallFunctionWithWideVectorsButLowerFreq();
} else {
  CallFunctionWithNarrowVectorsAndHigherFreq();
}

The critical result here is that if the user code *already contains* explicitly 512-bit width vector code, it seems reasonable for the vectorizer to alse produce 512-bit width vector code. So I don't think the IR needs to model a preference and a requirement independently in the IR for the purposes of the code generator. At most, it might want this for the purpose of controlling inlining and merging.

One thing I should clarify here:

What I'm imagining is reasonable to be per-subtarget and inform the legalizer are the *set of legal vector types*. This is subtly different from the *size of legal registers*. Does that still make sense?

One thing I should clarify here:

What I'm imagining is reasonable to be per-subtarget and inform the legalizer are the *set of legal vector types*. This is subtly different from the *size of legal registers*. Does that still make sense?

This split-hair makes sense to me and works for what I'm imagining we do.

Remove the required-vector-width attribute related code. Change legalization to be based on prefer-vector-width.

Add a pre-codegen IR pass to force the preferred vector width for any constructs that we know would fail isel if the prefer vector width is set to something smaller than required. The goal here is to protect correctness while I work on adding the attribute to clang's frontend. It will also protect from bad IR. I've also used it to experiment with running all x86 lit tests with prefer-vector-width forced to 256 to make sure nothing crashes or causes an isel failure.

I do wonder if we should have a some kind of disable flag for the legalization change independent of the preference. Is it useful to be able to test constraining TTI to a prefer vector width without constraining the legalizer? It could be useful if the IR pass added here doesn't catch everything that affects correctness.

Next steps:
-Make sure the TTI cost model understands whether 512 bits are legal based on the preference at the time the vectorizer runs. In particular it looks like X86TTIImpl::getCastInstrCost might not understand this correctly as most of its lookups aren't done with the result of getTypeLegalizationCost.
-Teach the inliner to merge the attribute.

Given the direction this is going, I'd suggest maybe renaming the attribute. I'm thinking something like legal_vetor_width or otherwise framing this as fundamentally a legalization constraint rather than anything more fundamental. I would also really suggest doing more than just one width shift in this patch because that will help ensure that the mechanism is working in the general way as intended.

I have a bunch of other nit-picky comments but can save those until this is more baked and you're looking to polish.

One last somewhat longer-term question is: have you thought about making this more targeted? Specifically, rather than making it a *type* legalization parameter, how crazy would it be to make this only apply to specific operations? I'm interested in potentially only preferring narrower vector widths for non-mov instructions so that loading and storing wider types can continue to occur. Thoughts?

lib/Target/X86/X86VectorWidthInfer.cpp
58–62 ↗(On Diff #140358)

I think if you just delete this code, everything below already works for arbitrary vector widths?

116–117 ↗(On Diff #140358)

You should be able to bound this (in both directions) by looking at the subtarget's register sizes. Which would somewhat illustrate that this is just parameterizing *legalization*, not the underlying physical machine description.

-Rename required-vector-width to legal-vector-width.
-Add a command line option to use prefer-vector-width for legal-vector-width if legal-vector-width is not already provided
-Simplify test to mostly just use the command line option so we don't have to repeat the test bodies with different attributes.

The intention here is to work towards removing prefer-vector-width in favor of legal-vector-width. This will require clang changes to set the legal-vector-width attribute. The command line option added here enables experimenting with the backend support while the clang changes are being developed.

craig.topper abandoned this revision.Oct 25 2018, 2:37 PM
test/CodeGen/X86/legal-vector-width.ll