This is an archive of the discontinued LLVM Phabricator instance.

[X86] Initial support for prefer-vector-width function attribute
AbandonedPublic

Authored by craig.topper on Dec 11 2017, 3:18 PM.

Details

Summary

This patch adds initial support for the prefer vector width function attribute. By processing it in getSubtargetImpl and translating it into a subtarget feature like we do for soft float.

I've implemented it this way specificically to allow skylake-avx512 to eventually enable this feature by default by adding the subtarget feature to the CPU definition. If the attribute isn't present we'll take the CPU default, if the attribute is present and specifies larger than 256 bits we'll append a -disable-prefer-avx256 to the feature string which will override the CPU default.

I've then passed this information out to TTI's getRegisterBitWidth() method. We probably also want to add support for the function attribute by itself to the vectorizers so that it works for non-x86 targets, but I've left that for a separate patch since its not directly required by my final goal and I'm less familiar with the vectorizers. X86 will still need to expose something via subtarget/TTI no matter what due to the skylake-avx512 requirement.

After this patch, I plan to start using this subtarget feature in X86ISelLowering.cpp to tell the type legalizer and assorted lowering code not to use 512-bit vectors. This seems the easiest way to ensure no 512-bit vectors are used and still allow the vectorizer to use larger types for interleaved accesses an other things. This will make the code as similar to AVX2 legalization as possible while still allowing xmm16-31, masking, gather/scatter, etc.

In order to support user code that uses target specific intrinsics that require wider vectors, I plan to add an X86 IR pass just before isel that will detect such intrinsics and explicitly add a prefer-vector-width=512 function attribute or replace an existing lower attribute with a higher value. This way we won't constrain the legalizer and will allow the wider types. Unfortunately, AVX512 C instrinics that we can represent with native IR would not trigger this pass and would be subject to being split by legalization.

Longer term we should add an IR pass earlier in the pipeline that alters the vector width based on any vectors that were present in the original IR. This would fix the native IR based intrinsic problem mentioned above. We would probably still need the X86 specific pass as a protection against not running the IR optimization pipeline.

This plan is derived from a conversation I had with Chandler and Eric Christopher on IRC a few weeks ago. Hopefully I've correctly captured what they were suggesting.

Diff Detail

Event Timeline

craig.topper created this revision.Dec 11 2017, 3:18 PM
craig.topper added subscribers: hsaito, egarcia.
echristo edited edge metadata.Dec 11 2017, 4:32 PM

I think one of my concerns here is that this also shows up on haswell with > 128-bit vectors. How would you want to fit that in here? A "Prefer128SIMD"?

Yeah we should be able to add another subtarget feature and process it in getSubtargetImpl as well.

I don't think we'll want to enable it by default on any Intel CPUs though since the issue there isn't as severe. So hopefully 256 is still a win in many workloads. And I don't know if I can sign up to do the work to implement the X86 isel legalization support that it would require to fully enforce it.

After this patch, I plan to start using this subtarget feature in X86ISelLowering.cpp to tell the type legalizer and assorted lowering code not to use 512-bit vectors.

I'm not sure this is a good idea. As you mention, it gets complicated around user code which uses intrinsics, and it also gets complicated in cases where the ABI requires zmm registers. And I'm not sure what benefit you're getting. Are you worried about DAGCombine introducing 512-bit vector operations? The vectorizer using 512-bit vectors even though the target says not to? Or some pre-isel pass other than the vectorizer randomly deciding to use 512-bit operations? Or something else I'm not thinking of?

The loop vectorizer definitely creates wider vectors even though its told not to.

For one it only considers the scalar types of loads, stores, and phis when determining the VF factor. So if all your loads/stores use i32, but some operations like compares or address calculations use i64 types due to zext/sext, the vectorizer doesn't see them when determining VF. I don't know enough about the vectorizer to say if that should be fixed or not.

There also the interleaved load/store optimization in the vector that very deliberately creates large loads, stores, and shuffles.

Good point on the ABI requirements.

hfinkel edited edge metadata.Dec 11 2017, 5:57 PM

The loop vectorizer definitely creates wider vectors even though its told not to.

For one it only considers the scalar types of loads, stores, and phis when determining the VF factor. So if all your loads/stores use i32, but some operations like compares or address calculations use i64 types due to zext/sext, the vectorizer doesn't see them when determining VF. I don't know enough about the vectorizer to say if that should be fixed or not.

Interesting. Under normal circumstances, vectorizing for the smallest/smaller type can make sense. This way you maximally use the vector lanes at all point in the calculation, and the larger types just take more than one underlying register. If using wider vectors affects the clock rate, for example, there's a large unaccounted-for cost (it's not a splitting cost, but an overall, potentially-large, penalty).

There also the interleaved load/store optimization in the vector that very deliberately creates large loads, stores, and shuffles.

Good point on the ABI requirements.

I've read the description, but it's still not clear to me what the desired result should be for this:

#include <immintrin.h>

__m512i add16elts(__m512i x, __m512i y) {
  return _mm512_add_epi32(x, y);
}

$ ./clang add512.c -O1 -S -o - -mavx512f -mprefer-vector-width=256 | grep padd
vpaddd %zmm0, %zmm1, %zmm0

I understand this wouldn't be affected by this patch as-is, but do we want the pref to override source that explicitly asked for a certain vector op?

@spatel based on my conversation with Chandler and Eric, I think they said we should have a pass before the vectorizers that detects any explicit vector code in the IR, If it finds any vector code, the pass would override the prefer-vector-width attribute to the length of the widest vector in that code.

If it finds any vector code, the pass would override the prefer-vector-width attribute to the length of the widest vector in that code.

That makes sense... sort of. If this target-specific pass runs as part of the optimization pass pipeline (as opposed to the codegen pass pipeline), we need to make sure we generate correct code even if the optimization pass doesn't run (if someone is using -O0, opt-bisect, etc.). So we probably need two attributes: one clang attribute "try to avoid zmm registers if possible" and one attribute "disable all avx-512 instructions which use zmm registers", and we only set the second attribute if we've analyzed the function and proved we don't need to use zmm registers.

Minor typo: I meant "one function attribute set by clang", not "one clang attribute".

If it finds any vector code, the pass would override the prefer-vector-width attribute to the length of the widest vector in that code.

That makes sense... sort of. If this target-specific pass runs as part of the optimization pass pipeline (as opposed to the codegen pass pipeline), we need to make sure we generate correct code even if the optimization pass doesn't run (if someone is using -O0, opt-bisect, etc.). So we probably need two attributes: one clang attribute "try to avoid zmm registers if possible" and one attribute "disable all avx-512 instructions which use zmm registers", and we only set the second attribute if we've analyzed the function and proved we don't need to use zmm registers.

Hrmm. I thought that this "preference" attribute was only going to affect TTI and other cost-model-based optimized-driven vectorization, not the lowering of explicit vectors. Does it do the latter?

If it finds any vector code, the pass would override the prefer-vector-width attribute to the length of the widest vector in that code.

That makes sense... sort of. If this target-specific pass runs as part of the optimization pass pipeline (as opposed to the codegen pass pipeline), we need to make sure we generate correct code even if the optimization pass doesn't run (if someone is using -O0, opt-bisect, etc.). So we probably need two attributes: one clang attribute "try to avoid zmm registers if possible" and one attribute "disable all avx-512 instructions which use zmm registers", and we only set the second attribute if we've analyzed the function and proved we don't need to use zmm registers.

Hrmm. I thought that this "preference" attribute was only going to affect TTI and other cost-model-based optimized-driven vectorization, not the lowering of explicit vectors. Does it do the latter?

In other words, just because I have some explicit vector code in some part of a function, that doesn't mean I want to vectorize to that vector width elsewhere in the function. If the problem is that, because figuring out when to use wide vectors is hard because it can affect the clock rate, etc., then that's still true regardless of what some other code path in a function might do.

@hfinkel, my goal is to get skylake-avx512 codegen as close to avx2 codegen while still allowing avx512vl features like masking, scatter, gather, etc. The current code generated by the loop vectorizer on avx2 will contain vectors larger than 256 bits even though the TTI interface says the max width is 256. Those large vectors will be split by the legalizer in codegen.

The easiest way I see to match close to avx2 codegen is to get the type legalizer to behave as if 512-bits aren't legal so that everything will be split in a similar way. But doing that requires making sure nothing exists in the IR that truly requires 512-bit vectors. Function arguments and 512-bit x86 specific intrinsics are the big things that we can't codegen correctly with the legalizer constrained.

Lying to the legalizer would also have the effect of splitting user code that doesn't use x86 specific intrinsics and instead uses native IR.

@hfinkel, sorry our updates passed each other.

I see your point, we shouldn't release the preference just because the user did something explicit. So in that case we would need to unconstrain the legalizer, but still keep the TTI interface reporting 256 so the vectorizer won't go out of its way to generate large vectors in the same function. And potentially add more cost modeling enhancements and potentially spot fixes into the codegen lowering.

So we probably do need another function attribute to indicate the safe width for the legalizer.

@hfinkel, sorry our updates passed each other.

I see your point, we shouldn't release the preference just because the user did something explicit. So in that case we would need to unconstrain the legalizer, but still keep the TTI interface reporting 256 so the vectorizer won't go out of its way to generate large vectors in the same function. And potentially add more cost modeling enhancements and potentially spot fixes into the codegen lowering.

So we probably do need another function attribute to indicate the safe width for the legalizer.

I think that we should have the preference separate from legalization, but also have hard cap that the vectorizer uses. The vectorizer will generate larger vectors than the current max for loops with a mixture of types, under the assumption that it's better to fill the vectors of the smaller types, even if that means legalization splitting for the vectors of larger types. We could cap this behavior, or have the vectorizer do manual legalization in that case. I think that the latter is probably the best that we can do (but a cap will do in the mean time).

How do you propose to control the cap? I don't think we want to default it to 256 for skx as that would make our codegen worse(or at the very least very different) from our avx2 codegen.

@hfinkel, sorry our updates passed each other.

I see your point, we shouldn't release the preference just because the user did something explicit. So in that case we would need to unconstrain the legalizer, but still keep the TTI interface reporting 256 so the vectorizer won't go out of its way to generate large vectors in the same function. And potentially add more cost modeling enhancements and potentially spot fixes into the codegen lowering.

So we probably do need another function attribute to indicate the safe width for the legalizer.

I think that we should have the preference separate from legalization, but also have hard cap that the vectorizer uses. The vectorizer will generate larger vectors than the current max for loops with a mixture of types, under the assumption that it's better to fill the vectors of the smaller types, even if that means legalization splitting for the vectors of larger types. We could cap this behavior, or have the vectorizer do manual legalization in that case. I think that the latter is probably the best that we can do (but a cap will do in the mean time).

Also, as I understand it, the other complication is that there are some AVX-512 that's okay (specifically, things that are simple (i.e., no floating point, and no integer multiplication, but a lot of other integer ops)). Maybe we want to account for that somehow?

How do you propose to control the cap? I don't think we want to default it to 256 for skx as that would make our codegen worse(or at the very least very different) from our avx2 codegen.

I really do want to make sure that I understand the problem (before I continue suggesting solutions). To summarize:

Skylake microarchitecture has two port schemes, one for using 256-bit or less registers, and another for using 512-bit registers.

When using registers up to or including 256 bits, FMA operations dispatch to ports 0 and 1 and SIMD operations dispatch to ports 0, 1 and 5. When using 512-bit register operations, both FMA and SIMD operations dispatch to ports 0 and 5.
The maximum register width in the reservation station (RS) determines the 256 or 512 port scheme. Notice that when you use AVX-512 encoded instructions with YMM registers, the instructions are considered to be 256-bit wide.

The result of the 512-bit port scheme is that XMM or YMM code dispatches to 2 ports (0 and 5) instead of 3 ports (0, 1 and 5) and may have lower throughput and longer latency compared to the 256-bit port scheme.

  • But there's an additional complication, discussed in 15.20 of the optimization manual:

Some processors based on Skylake microarchitecture have two Intel AVX-512 FMA units, on ports 0 and 5, while other processors based on Skylake microarchitecture have a single Intel AVX-512 FMA unit, which is located on port 0.

Code that is optimized to run on a processor with two FMA units might not be optimal when run on a processor with one FMA unit.

If these are the relevant causes of the problem, then I suggest that we do the following:

  1. We really have two different Skylake cores for optimization purposes: The ones that execute AVX-512 only on port 0/1, and the ones that also can execute AVX-512 on port 5. We should stop pretending that these are the same cores, calling them both skylake-avx512, and call the former something else (say skylake-avx512-1p). These cores should have different scheduling models. We should benchmark these differently, and for the port-0-only variant, we should increase the TTI costs for all AVX-512 instruction by a factor of 2 (which is right because TTI returns reciprocal throughputs for vectorization), or maybe a little more than 2 (to account for the extra ILP constraints) if necessary.
  2. We need a way for the X86 backend to penalize mixing 512-bit vectors with smaller vector types. It can do this by increasing the cost of the smaller non-FMA vector operations, when 512-bit vectors are around, by 30% (to account for the fact that, when 512-bit instructions are around, the peak throughput of those instructions is decreased from 3/cycle to 2/cycle). We can add a TTI interface that allows the target to analyze the loop, generate some state object, and then use that state object when generating the per-instruction costs.
  3. We need a way for the target to specify a penalty factor for a loop (etc.) when vectorizing. Right now, if there appears to be a positive speedup, then the associated vectorization factor is chosen. To adjust for the clock-rate decrease. If we're generating 512-bit instructions, we should apply a penalty factor of, say, 0.7, so that estimating that vectorization will be profitable includes the effect of the (potentially) decreasing clock rate.

All of that might not be enough, however, because the clock-rate effects are not entirely local. We could have an absolute cap for small-trip-count loops, and for SLP vectorization. For the loop vectorizer, look at LoopVectorizationCostModel::computeFeasibleMaxVF (also, I'll take back something I said: picking a vectorization factor based on the smallest type, not the largest one, doesn't seem to be enabled by default right now, because -vectorizer-maximize-bandwidth is false by default, although it looks like we'd like it to be on, see r306936). There are a couple of places in the SLP vectorizer where the VF is computed, and capping those seems straightforward.

Now many none of this really helps, because you end up with loops with dynamically-small trip counts, where the vectorization speedup is negligible, but the presence of the wide vector instructions causes clock-rate decreases. Maybe you can't even reliably multiversion and branch around the vector code (because even speculatively executing the vector instructions triggers the problem), then we need to decide how much we care about these cases vs. speedups in other areas. However, I think that we should start by modeling what we can model, and then evaluate things from there.

I really do want to make sure that I understand the problem (before I continue suggesting solutions). To summarize:

  • AVX-512 is twice the length of AVX2, and so using AVX-512 over AVX2 should give a 2x speedup, but...

I want to avoid using the term AVX-512 and AVX2 here and use ZMM and YMM or vector width. There are new instructions introduced after AVX512F as part of the AVX512VL instruction set that use only XMM and YMM registers and are not subject to this frequency issue. Our documentation really doesn't make that clear as it uses "AVX2". Enabling avx512vl subtarget feature implies support for AVX512F and thus the support for 512-bit vectors.

Skylake microarchitecture has two port schemes, one for using 256-bit or less registers, and another for using 512-bit registers.

When using registers up to or including 256 bits, FMA operations dispatch to ports 0 and 1 and SIMD operations dispatch to ports 0, 1 and 5. When using 512-bit register operations, both FMA and SIMD operations dispatch to ports 0 and 5.
The maximum register width in the reservation station (RS) determines the 256 or 512 port scheme. Notice that when you use AVX-512 encoded instructions with YMM registers, the instructions are considered to be 256-bit wide.

The result of the 512-bit port scheme is that XMM or YMM code dispatches to 2 ports (0 and 5) instead of 3 ports (0, 1 and 5) and may have lower throughput and longer latency compared to the 256-bit port scheme.

  • But there's an additional complication, discussed in 15.20 of the optimization manual:

Some processors based on Skylake microarchitecture have two Intel AVX-512 FMA units, on ports 0 and 5, while other processors based on Skylake microarchitecture have a single Intel AVX-512 FMA unit, which is located on port 0.

Code that is optimized to run on a processor with two FMA units might not be optimal when run on a processor with one FMA unit.

If these are the relevant causes of the problem, then I suggest that we do the following:

  1. We really have two different Skylake cores for optimization purposes: The ones that execute AVX-512 only on port 0/1, and the ones that also can execute AVX-512 on port 5. We should stop pretending that these are the same cores, calling them both skylake-avx512, and call the former something else (say skylake-avx512-1p). These cores should have different scheduling models. We should benchmark these differently, and for the port-0-only variant, we should increase the TTI costs for all AVX-512 instruction by a factor of 2 (which is right because TTI returns reciprocal throughputs for vectorization), or maybe a little more than 2 (to account for the extra ILP constraints) if necessary.

Yes there are two variants of sklake-avx512, but there doesn't seem to be a good way of autodetecting this for march=native.

In general, I don't think our cost models distinquish based on CPUs do they? Aren't they based only on subtarget features?

  1. We need a way for the X86 backend to penalize mixing 512-bit vectors with smaller vector types. It can do this by increasing the cost of the smaller non-FMA vector operations, when 512-bit vectors are around, by 30% (to account for the fact that, when 512-bit instructions are around, the peak throughput of those instructions is decreased from 3/cycle to 2/cycle). We can add a TTI interface that allows the target to analyze the loop, generate some state object, and then use that state object when generating the per-instruction costs.
  2. We need a way for the target to specify a penalty factor for a loop (etc.) when vectorizing. Right now, if there appears to be a positive speedup, then the associated vectorization factor is chosen. To adjust for the clock-rate decrease. If we're generating 512-bit instructions, we should apply a penalty factor of, say, 0.7, so that estimating that vectorization will be profitable includes the effect of the (potentially) decreasing clock rate.

All of that might not be enough, however, because the clock-rate effects are not entirely local. We could have an absolute cap for small-trip-count loops, and for SLP vectorization. For the loop vectorizer, look at LoopVectorizationCostModel::computeFeasibleMaxVF (also, I'll take back something I said: picking a vectorization factor based on the smallest type, not the largest one, doesn't seem to be enabled by default right now, because -vectorizer-maximize-bandwidth is false by default, although it looks like we'd like it to be on, see r306936). There are a couple of places in the SLP vectorizer where the VF is computed, and capping those seems straightforward.

Yeah the VF factor is calculated by the largest scalar type of loads, stores, and phis I think.

Now many none of this really helps, because you end up with loops with dynamically-small trip counts, where the vectorization speedup is negligible, but the presence of the wide vector instructions causes clock-rate decreases. Maybe you can't even reliably multiversion and branch around the vector code (because even speculatively executing the vector instructions triggers the problem), then we need to decide how much we care about these cases vs. speedups in other areas. However, I think that we should start by modeling what we can model, and then evaluate things from there.

As you said the effect is not local because once you trigger it there is a timer for the penalty to be in effect. I'm sure speculation would trigger it too since this penalty is about power delivery to the execution units.

I appreciate the cost modeling suggestions and I think there could be a good long term solution in doing that, but I think that will require a lot more tuning effort and its unclear if that could be made to work.

What Intel wants to see implemented right now is a way to remove as much zmm register usage as possible by default on skylake-avx512 without losing the avx512vl capabilities. If the enabling of avx512vl didn't automatically imply the availablity of avx512f and 512-bit intrinsics we would probably just turn the 512 bit support off by default in the legalizer very easily. But the dependencies don't work that way.

I really do want to make sure that I understand the problem (before I continue suggesting solutions). To summarize:

  • AVX-512 is twice the length of AVX2, and so using AVX-512 over AVX2 should give a 2x speedup, but...

I want to avoid using the term AVX-512 and AVX2 here and use ZMM and YMM or vector width. There are new instructions introduced after AVX512F as part of the AVX512VL instruction set that use only XMM and YMM registers and are not subject to this frequency issue. Our documentation really doesn't make that clear as it uses "AVX2".

Sure. Let's do that.

Enabling avx512vl subtarget feature implies support for AVX512F and thus the support for 512-bit vectors.

Maybe we should change that?

Skylake microarchitecture has two port schemes, one for using 256-bit or less registers, and another for using 512-bit registers.

When using registers up to or including 256 bits, FMA operations dispatch to ports 0 and 1 and SIMD operations dispatch to ports 0, 1 and 5. When using 512-bit register operations, both FMA and SIMD operations dispatch to ports 0 and 5.
The maximum register width in the reservation station (RS) determines the 256 or 512 port scheme. Notice that when you use AVX-512 encoded instructions with YMM registers, the instructions are considered to be 256-bit wide.

The result of the 512-bit port scheme is that XMM or YMM code dispatches to 2 ports (0 and 5) instead of 3 ports (0, 1 and 5) and may have lower throughput and longer latency compared to the 256-bit port scheme.

  • But there's an additional complication, discussed in 15.20 of the optimization manual:

Some processors based on Skylake microarchitecture have two Intel AVX-512 FMA units, on ports 0 and 5, while other processors based on Skylake microarchitecture have a single Intel AVX-512 FMA unit, which is located on port 0.

Code that is optimized to run on a processor with two FMA units might not be optimal when run on a processor with one FMA unit.

If these are the relevant causes of the problem, then I suggest that we do the following:

  1. We really have two different Skylake cores for optimization purposes: The ones that execute AVX-512 only on port 0/1, and the ones that also can execute AVX-512 on port 5. We should stop pretending that these are the same cores, calling them both skylake-avx512, and call the former something else (say skylake-avx512-1p). These cores should have different scheduling models. We should benchmark these differently, and for the port-0-only variant, we should increase the TTI costs for all AVX-512 instruction by a factor of 2 (which is right because TTI returns reciprocal throughputs for vectorization), or maybe a little more than 2 (to account for the extra ILP constraints) if necessary.

Yes there are two variants of sklake-avx512, but there doesn't seem to be a good way of autodetecting this for march=native.

Your optimization manual suggests a relative timing test, so I'm guessing there's not (I wouldn't want to use that for -march=native because it wouldn't be deterministic). As a result, I think we'll just need to make a choice based on some combination of which is likely to be most common among our users and which is likely best on future hardware. Users will need to explicitly specify the architecture to get the other one.

In general, I don't think our cost models distinquish based on CPUs do they? Aren't they based only on subtarget features?

Generally, we add subtarget features as necessary. Some things get pulled from the scheduling models, and scheduling models are per-CPU. We could do other things on a per-CPU basis if appropriate.

  1. We need a way for the X86 backend to penalize mixing 512-bit vectors with smaller vector types. It can do this by increasing the cost of the smaller non-FMA vector operations, when 512-bit vectors are around, by 30% (to account for the fact that, when 512-bit instructions are around, the peak throughput of those instructions is decreased from 3/cycle to 2/cycle). We can add a TTI interface that allows the target to analyze the loop, generate some state object, and then use that state object when generating the per-instruction costs.
  2. We need a way for the target to specify a penalty factor for a loop (etc.) when vectorizing. Right now, if there appears to be a positive speedup, then the associated vectorization factor is chosen. To adjust for the clock-rate decrease. If we're generating 512-bit instructions, we should apply a penalty factor of, say, 0.7, so that estimating that vectorization will be profitable includes the effect of the (potentially) decreasing clock rate.

All of that might not be enough, however, because the clock-rate effects are not entirely local. We could have an absolute cap for small-trip-count loops, and for SLP vectorization. For the loop vectorizer, look at LoopVectorizationCostModel::computeFeasibleMaxVF (also, I'll take back something I said: picking a vectorization factor based on the smallest type, not the largest one, doesn't seem to be enabled by default right now, because -vectorizer-maximize-bandwidth is false by default, although it looks like we'd like it to be on, see r306936). There are a couple of places in the SLP vectorizer where the VF is computed, and capping those seems straightforward.

Yeah the VF factor is calculated by the largest scalar type of loads, stores, and phis I think.

Now many none of this really helps, because you end up with loops with dynamically-small trip counts, where the vectorization speedup is negligible, but the presence of the wide vector instructions causes clock-rate decreases. Maybe you can't even reliably multiversion and branch around the vector code (because even speculatively executing the vector instructions triggers the problem), then we need to decide how much we care about these cases vs. speedups in other areas. However, I think that we should start by modeling what we can model, and then evaluate things from there.

As you said the effect is not local because once you trigger it there is a timer for the penalty to be in effect. I'm sure speculation would trigger it too since this penalty is about power delivery to the execution units.

I appreciate the cost modeling suggestions and I think there could be a good long term solution in doing that, but I think that will require a lot more tuning effort and its unclear if that could be made to work.

What Intel wants to see implemented right now is a way to remove as much zmm register usage as possible by default on skylake-avx512 without losing the avx512vl capabilities. If the enabling of avx512vl didn't automatically imply the availablity of avx512f and 512-bit intrinsics we would probably just turn the 512 bit support off by default in the legalizer very easily. But the dependencies don't work that way.

Why don't we just change the dependencies? I realize that's work, but if that's what's desired, then that should be our preferred direction. I suppose it means separating the predicates on a bunch of instructions, but it doesn't seem otherwise complicated. Is it?

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

Yes there are two variants of sklake-avx512, but there doesn't seem to be a good way of autodetecting this for march=native.

Your optimization manual suggests a relative timing test, so I'm guessing there's not (I wouldn't want to use that for -march=native because it wouldn't be deterministic). As a result, I think we'll just need to make a choice based on some combination of which is likely to be most common among our users and which is likely best on future hardware. Users will need to explicitly specify the architecture to get the other one.

There is avx512_2ndFMA (PIROM offset 70h bit 0), see: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-scalable-datasheet-vol-1.pdf

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

Instead of lying to the legalizer, I'd recommend the 2nd option that was proposed (but not exactly in these terms) - create an imaginary skylake CPU variant that does what you want: it has the new instructions, but it doesn't have 512-bit flavors of anything. Let's call this hypothetical subtarget 'skylake-avx257'. In the common case where there's no explicit 512-bit instruction usage in a function, we would just translate -march=skylake=avx512 to -march=skylake-avx257. That guarantees no 512-bit ops for that function. In the rare case where there is explicit 512-bitness, we fall back to using prefer-vector-width. We can't guarantee what will happen in that case, but we do our best and hopefully it's too rare to care about.

The upside is that it's probably just a matter of time before the fake target becomes a real product anyway, so we're going to need a skylake-avx257 sooner or later. :)

How/where do you propose detecting the presence or absense of 512-bit instructions to change the CPU name?

How/where do you propose detecting the presence or absense of 512-bit instructions to change the CPU name?

I think the path through clang's CodeGenFunction::checkTargetFeatures() is where we tell the user if they've used an intrinsic without the required CPU attribute. So somewhere around there might work?

That would miss intrinsics implemented using macros and it would miss function arguments that are 512 bits that need to pass in zmm registers.

How/where do you propose detecting the presence or absense of 512-bit instructions to change the CPU name?

I think the path through clang's CodeGenFunction::checkTargetFeatures() is where we tell the user if they've used an intrinsic without the required CPU attribute. So somewhere around there might work?

I think this would go in sys::getHostCPUName in lib/Support/Host.cpp (or in sys::getHostCPUFeatures as appropriate).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

Yes there are two variants of sklake-avx512, but there doesn't seem to be a good way of autodetecting this for march=native.

Your optimization manual suggests a relative timing test, so I'm guessing there's not (I wouldn't want to use that for -march=native because it wouldn't be deterministic). As a result, I think we'll just need to make a choice based on some combination of which is likely to be most common among our users and which is likely best on future hardware. Users will need to explicitly specify the architecture to get the other one.

There is avx512_2ndFMA (PIROM offset 70h bit 0), see: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-scalable-datasheet-vol-1.pdf

Is that accessible in user space somehow?

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

Alternatively, I was thinking about a separate "required-vector-width" attribute. Either clang codegen, or an early IR pass could analyze the function and set the attribute based on the largest vector width, or OpenMP SIMD, ABI requirement, etc. The inliner could merge this my maxing the caller and callee value. This could be generated independent of the target being X86. This attribute could be consumed by the X86 backend to limit the legalizer if its present and the value is 256 or less and the CPU is skylake-avx512 or any CPU with the frequency.

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

We don't currently, we have only areInlineCompatible in TTI. This is called like this:

return TTI.areInlineCompatible(Caller, Callee) &&
       AttributeFuncs::areInlineCompatible(*Caller, *Callee);

we also have a:

AttributeFuncs::mergeAttributesForInlining(*Caller, *Callee);

adding a corresponding TTI function and calling it in the two places where AttributeFuncs::mergeAttributesForInlining is called would be straightforward.

Alternatively, I was thinking about a separate "required-vector-width" attribute. Either clang codegen, or an early IR pass

I prefer that we do this in Clang's CodeGen. We just don't have enough information at the IR level to differentiate between things the user explicitly requested and things that have been added by some earlier stage automatically (plus, the pass method would need to rely on pass injection, or similar, and that won't work with frontends with custom pipelines anyway).

could analyze the function and set the attribute based on the largest vector width, or OpenMP SIMD, ABI requirement, etc. The inliner could merge this my maxing the caller and callee value. This could be generated independent of the target being X86. This attribute could be consumed by the X86 backend to limit the legalizer if its present and the value is 256 or less and the CPU is skylake-avx512 or any CPU with the frequency.

That makes sense to me.

In the backend, I imagine you'll still essentially end up splitting the features and then setting things similar in this patch in getSubtargetImpl. Is that the idea?

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

We don't currently, we have only areInlineCompatible in TTI. This is called like this:

return TTI.areInlineCompatible(Caller, Callee) &&
       AttributeFuncs::areInlineCompatible(*Caller, *Callee);

we also have a:

AttributeFuncs::mergeAttributesForInlining(*Caller, *Callee);

adding a corresponding TTI function and calling it in the two places where AttributeFuncs::mergeAttributesForInlining is called would be straightforward.

Alternatively, I was thinking about a separate "required-vector-width" attribute. Either clang codegen, or an early IR pass

I prefer that we do this in Clang's CodeGen. We just don't have enough information at the IR level to differentiate between things the user explicitly requested and things that have been added by some earlier stage automatically (plus, the pass method would need to rely on pass injection, or similar, and that won't work with frontends with custom pipelines anyway).

Does "pass injection" here mean having a hook to put in a target specific pass or something else?

could analyze the function and set the attribute based on the largest vector width, or OpenMP SIMD, ABI requirement, etc. The inliner could merge this my maxing the caller and callee value. This could be generated independent of the target being X86. This attribute could be consumed by the X86 backend to limit the legalizer if its present and the value is 256 or less and the CPU is skylake-avx512 or any CPU with the frequency.

That makes sense to me.

In the backend, I imagine you'll still essentially end up splitting the features and then setting things similar in this patch in getSubtargetImpl. Is that the idea?

Yes, I'd translate the attribute in getSubtargetImpl into a subtarget feature that I can look at in X86ISelLowering.cpp

If we say that "required-vector-width" is a target independent attribute, can we do the clang codegen part in a target independent way? Could we just keep track of the requirement as we're codegening the function and tack on the attribute at the end of emitting the function? Then we don't have to separately walk the AST

The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.

A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.

Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.

So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.

I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).

getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?

I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.

I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.

The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.

I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

We don't currently, we have only areInlineCompatible in TTI. This is called like this:

return TTI.areInlineCompatible(Caller, Callee) &&
       AttributeFuncs::areInlineCompatible(*Caller, *Callee);

we also have a:

AttributeFuncs::mergeAttributesForInlining(*Caller, *Callee);

adding a corresponding TTI function and calling it in the two places where AttributeFuncs::mergeAttributesForInlining is called would be straightforward.

Alternatively, I was thinking about a separate "required-vector-width" attribute. Either clang codegen, or an early IR pass

I prefer that we do this in Clang's CodeGen. We just don't have enough information at the IR level to differentiate between things the user explicitly requested and things that have been added by some earlier stage automatically (plus, the pass method would need to rely on pass injection, or similar, and that won't work with frontends with custom pipelines anyway).

Does "pass injection" here mean having a hook to put in a target specific pass or something else?

Yes (or just adding it into the default pipeline, both suffer from the same problems).

could analyze the function and set the attribute based on the largest vector width, or OpenMP SIMD, ABI requirement, etc. The inliner could merge this my maxing the caller and callee value. This could be generated independent of the target being X86. This attribute could be consumed by the X86 backend to limit the legalizer if its present and the value is 256 or less and the CPU is skylake-avx512 or any CPU with the frequency.

That makes sense to me.

In the backend, I imagine you'll still essentially end up splitting the features and then setting things similar in this patch in getSubtargetImpl. Is that the idea?

Yes, I'd translate the attribute in getSubtargetImpl into a subtarget feature that I can look at in X86ISelLowering.cpp

If we say that "required-vector-width" is a target independent attribute, can we do the clang codegen part in a target independent way?

Yes, I think so. We'll know the size of any vector types used (which should cover explicit uses of vector types and intrinsics). For explicitly-vectorized loops, we don't quite have the right thing -- we have SimdDefaultAlign in TargetInfo, which is essentially correct, albeit with the wrong name:

SimdDefaultAlign =
    hasFeature("avx512f") ? 512 : hasFeature("avx") ? 256 : 128;

Could we just keep track of the requirement as we're codegening the function and tack on the attribute at the end of emitting the function? Then we don't have to separately walk the AST

Yes, I think so.

Ok so now that we have a plan. Can we review this patch on its content?

Ok so now that we have a plan. Can we review this patch on its content?

My person preference is for additive features, not subtractive ones. Can you split AVX-512 into the 512-bit-register part and the rest? Do you prefer to do it this way? They'll need to be a follow-up patch to adjust the calls to addRegisterClass, etc. in X86ISelLowering, and I suspect the code will look cleaner in the end with additive features.

Normally I would prefer additive features too.

For the prefer-avx256 feature in this patch, I need the preference to only apply with -march=skylake-avx512/native and not with -mavx512f or with -march=knl. So I think that means it needs to be set in the default features in skylake-avx512 cpu definition. And the prefer-vector-width attribute needs to remove it if the user specifies a higher preference

For the next patch that will deal with legalization. I think i still have to do something weird because I can't see the features implied by the CPU name string in getSubtargetImpl. So I can't just blindly enable a feature flag for 512-bit register support there. So I need to add something like "+requires512bitvectors" into the feature string based on the attribute, but the lack of a "required-vector-width" attribute implies we don't know for sure. So maybe its better to have "+no512bitvectors" if the attribute is present and set to value 256 or less. Then in X86ISelLowering we would enable 512 bit types with "hasAVX512 && !(no512bitvectors && prefer-avx256)"

Do you see any better way?

Adjust a comment.

Normally I would prefer additive features too.

For the prefer-avx256 feature in this patch, I need the preference to only apply with -march=skylake-avx512/native and not with -mavx512f or with -march=knl. So I think that means it needs to be set in the default features in skylake-avx512 cpu definition. And the prefer-vector-width attribute needs to remove it if the user specifies a higher preference

For the next patch that will deal with legalization. I think i still have to do something weird because I can't see the features implied by the CPU name string in getSubtargetImpl. So I can't just blindly enable a feature flag for 512-bit register support there. So I need to add something like "+requires512bitvectors" into the feature string based on the attribute, but the lack of a "required-vector-width" attribute implies we don't know for sure. So maybe its better to have "+no512bitvectors" if the attribute is present and set to value 256 or less. Then in X86ISelLowering we would enable 512 bit types with "hasAVX512 && !(no512bitvectors && prefer-avx256)"

Do you see any better way?

I would add a target feature, like:

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Then I'd add this feature, separately, to the relevant CPU definitions (KNLFeatures and SKXFeatures).

Then, in getSubtargetImpl, if the CPU name is skylake-avx512 or skx, then look at the required-vector-width attribute. If it is present, and is <= 256, then add "-avx512-wide-vectors" to the feature string.

I think gives us what we want with only additive features.

and not with -mavx512f

Do you mean that you want -march=skylake -mavx512f should effectively set a preference for 512-bit vectors? If so, I think we can handle that in the frontend (by just setting the preference there appropriately).

Normally I would prefer additive features too.

For the prefer-avx256 feature in this patch, I need the preference to only apply with -march=skylake-avx512/native and not with -mavx512f or with -march=knl. So I think that means it needs to be set in the default features in skylake-avx512 cpu definition. And the prefer-vector-width attribute needs to remove it if the user specifies a higher preference

For the next patch that will deal with legalization. I think i still have to do something weird because I can't see the features implied by the CPU name string in getSubtargetImpl. So I can't just blindly enable a feature flag for 512-bit register support there. So I need to add something like "+requires512bitvectors" into the feature string based on the attribute, but the lack of a "required-vector-width" attribute implies we don't know for sure. So maybe its better to have "+no512bitvectors" if the attribute is present and set to value 256 or less. Then in X86ISelLowering we would enable 512 bit types with "hasAVX512 && !(no512bitvectors && prefer-avx256)"

Do you see any better way?

I would add a target feature, like:

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Then I'd add this feature, separately, to the relevant CPU definitions (KNLFeatures and SKXFeatures).

Then, in getSubtargetImpl, if the CPU name is skylake-avx512 or skx, then look at the required-vector-width attribute. If it is present, and is <= 256, then add "-avx512-wide-vectors" to the feature string.

I think gives us what we want with only additive features.

I don't really want a separate list of CPUs with this issue that we have to remember to update each time we add a CPU. Its likely "cannonlake" and "icelake" need this too so that's 4 strings we need to check already. At least in the td file we'll probably copy a CPU when we add the next one and we'll see the PreferAVX256 feature and make a conscious decision. We already have several places that check a subtarget feature called isSLM() which is like checking CPU=="silvermont" and none of those places have been audited for goldmont. And https://reviews.llvm.org/D40282 which fixed another place that tried to do effectively a CPU string comparison and got it wrong.

and not with -mavx512f

Do you mean that you want -march=skylake -mavx512f should effectively set a preference for 512-bit vectors? If so, I think we can handle that in the frontend (by just setting the preference there appropriately).

No I was only referring to a -mavx512f with no -march case. If a user has specified -march that preference should stay.

Also doesn't this mean that we have to explicitly force +avx512-wide-vectors somewhere if the user passes "-mavx512f" with no -march?

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Normally I would prefer additive features too.

For the prefer-avx256 feature in this patch, I need the preference to only apply with -march=skylake-avx512/native and not with -mavx512f or with -march=knl. So I think that means it needs to be set in the default features in skylake-avx512 cpu definition. And the prefer-vector-width attribute needs to remove it if the user specifies a higher preference

For the next patch that will deal with legalization. I think i still have to do something weird because I can't see the features implied by the CPU name string in getSubtargetImpl. So I can't just blindly enable a feature flag for 512-bit register support there. So I need to add something like "+requires512bitvectors" into the feature string based on the attribute, but the lack of a "required-vector-width" attribute implies we don't know for sure. So maybe its better to have "+no512bitvectors" if the attribute is present and set to value 256 or less. Then in X86ISelLowering we would enable 512 bit types with "hasAVX512 && !(no512bitvectors && prefer-avx256)"

Do you see any better way?

I would add a target feature, like:

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Then I'd add this feature, separately, to the relevant CPU definitions (KNLFeatures and SKXFeatures).

Then, in getSubtargetImpl, if the CPU name is skylake-avx512 or skx, then look at the required-vector-width attribute. If it is present, and is <= 256, then add "-avx512-wide-vectors" to the feature string.

I think gives us what we want with only additive features.

I don't really want a separate list of CPUs with this issue that we have to remember to update each time we add a CPU. Its likely "cannonlake" and "icelake" need this too so that's 4 strings we need to check already. At least in the td file we'll probably copy a CPU when we add the next one and we'll see the PreferAVX256 feature and make a conscious decision. We already have several places that check a subtarget feature called isSLM() which is like checking CPU=="silvermont" and none of those places have been audited for goldmont. And https://reviews.llvm.org/D40282 which fixed another place that tried to do effectively a CPU string comparison and got it wrong.

Good point. We'll need a list somewhere (either here or in the frontend). knl/knm are the odd ones out here. We could always add -avx512-wide-vectors unless the CPU is knl/knm. Or we always do it here and adjust how we add the attribute in the frontend. What do you think?

and not with -mavx512f

Do you mean that you want -march=skylake -mavx512f should effectively set a preference for 512-bit vectors? If so, I think we can handle that in the frontend (by just setting the preference there appropriately).

No I was only referring to a -mavx512f with no -march case. If a user has specified -march that preference should stay.

Also doesn't this mean that we have to explicitly force +avx512-wide-vectors somewhere if the user passes "-mavx512f" with no -march?

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Yes, we could do that in the frontend. Or, we can make "core" AVX512 have a different name, add that to the processor feature lists, and then have a tablegen "avx512f" feature that implies both (that way -mavx512f will still act as it does today).

Also doesn't this mean that we have to explicitly force +avx512-wide-vectors somewhere if the user passes "-mavx512f" with no -march?

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Yes, we could do that in the frontend. Or, we can make "core" AVX512 have a different name, add that to the processor feature lists, and then have a tablegen "avx512f" feature that implies both (that way -mavx512f will still act as it does today).

Well we already have an implies relationship that +avx512vl(and other avx512 features) implies +avx512f that we need to continue working. But if you make "+avx512f" imply +avx512-wide-vectors in tablegen, you also get an opposite implication that -avx512-wide-vectors implies -avx512f which would then imply -avx512vl. That disabling everything.

We could update clang and getHostCPUFeatures, but then all our lit tests need to be updated and we break JIT users.

Also doesn't this mean that we have to explicitly force +avx512-wide-vectors somewhere if the user passes "-mavx512f" with no -march?

def FeatureAVX512WideVectors : SubtargetFeature<"avx512-wide-vectors", "HasAVX512WideVectors", "true", "Enable use of 512-bit Vector Registers", [FeatureAVX512]>;

Yes, we could do that in the frontend. Or, we can make "core" AVX512 have a different name, add that to the processor feature lists, and then have a tablegen "avx512f" feature that implies both (that way -mavx512f will still act as it does today).

Well we already have an implies relationship that +avx512vl(and other avx512 features) implies +avx512f that we need to continue working. But if you make "+avx512f" imply +avx512-wide-vectors in tablegen, you also get an opposite implication that -avx512-wide-vectors implies -avx512f which would then imply -avx512vl. That disabling everything.

We could update clang and getHostCPUFeatures, but then all our lit tests need to be updated and we break JIT users.

We need to update Clang and getHostCPUFeatures anyway, but I think we can arrange the features so that we don't need to update the lit tests or break JIT users. If we need +avx512f to imply +avx512-wide-vectors, and we need +avx512vl (and other avx512 features) to imply +avx512f, and we don't want those names to change (so that we don't need to update lit tests and don't break JITs, etc.), then we'll need to split all of the features. So +avx512f functionally becomes +avx512f-core and +avx512f becomes a TableGen definition implying +avx512f-core and +avx512-wide-vectors. +avx512vl becomes a TableGen definition implying +avx512vl-core and +avx512-wide-vectors, and so on. We'll double the number of feature definitions in TableGen, but leave the actual number of subtarget members the same (plus one for HasAVX512WideVectors).

Ok I’ll look into doing it this way, but I think “-mattr=+avx512vl,-avx512vl” would enable avx512vl-core and avx512-wide-registers and not turn them off.

craig.topper abandoned this revision.Jan 9 2018, 1:41 PM