Details
- Reviewers
- craig.topper - efriedma - echristo 
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Isn't X86TTIImpl::areTypesABICompatible supposed to be preventing promotion when it was cause a mismatch?
It doesn't work in complicated situations, e.g.: https://godbolt.org/z/386n6ErcY
On the other hand, I think it makes sense to always update min-legal-vector-width to the largest vector width.
That test case needs a triple and -mcpu that has avx512vl and prefer-vector-width=256. There's no ABI mismatch otherwise https://godbolt.org/z/h3vd3TvsW
Thanks Craig. I didn't notice it. It still fails if we switch the order of -inline and -argpromotion. https://godbolt.org/z/zo3hba8xW
Oh, I see the ArgumentPromotion wasn't wrong because the @foo and @bar both have min-legal-vector-width=0. Then @baz gets inlined into @baz and increases the min-legal-vector-width of @foo.
Is that really an inliner problem? Or are we assuming that if there is a vector argument, then min-legal-vector-width should always be >= the size of the vector argument?
Do we have documentation for min-legal-vector-width anywhere in-tree?
I guess the hierarchy currently works like the following:
- CPU attributes control what register width codegen for a function can assume is available.
- prefer-vector-width reduces the max width of registers codegen actually uses, for the sake of avoiding vector ops that force the CPU to downclock etc.
- min-legal-vector-width makes codegen ignore prefer-vector-width, so intrinsics etc. work transparently.
Then for calls:
- If a call a function with a very wide vector argument, it's split according to prefer-vector-width/min-legal-vector-width of the caller.
- If a function definition has a very wide vector argument, it's split according to prefer-vector-width/min-legal-vector-width on that definition.
Therefore, if a function has a vector argument, the caller and the callee must have the same values of min-legal-vector-width/prefer-vector-width...? I guess this follows from the above rules, but I'm not sure it's the final state we actually want.
I'm tempted to say we should have a separate abi-vector-width; this controls how vector arguments to functions are split, and no other function attributes have an effect. If we have that, we can ensure the ABI is consistent without worrying about prefer-vector-width/min-legal-vector-width/target features.
No. I didn't investigate it much, but I think it's not an inliner problem. Increasing "min-legal-vector-width" of a function should always be safe if we never touch the arguments. And we should always increasing it to make sure the correctness of the inlined functions. Here the ArgumentPromotion changes arguments without setting the correct "min-legal-vector-width" that looks a problem to me.
Or are we assuming that if there is a vector argument, then min-legal-vector-width should always be >= the size of the vector argument?
Yes and no. We can't always set the "min-legal-vector-width" to the size of  the largest vector argument. From the perspective of ABI, the "min-legal-vector-width" is the largest vector that meets the current ABI.
But when the ABI doesn't matter, we are free to set it to the size of  the largest vector argument.
ArgumentPromotion is a good example, where we can break ABI due to we do it for internal functions only. E.g., https://godbolt.org/z/dabbdcYPb
As you can see, __m1024 should be passed by memory according to ABI. But we will spilt it if we build a 1024 bit vector in arguments.
As @andrew.w.kaylor has pointer on https://discourse.llvm.org/t/vector-abi-and-min-legal-vector-width/60615, middle end usually doesn't have the ability to check the ABI and decides the proper vector or memory thing. So I think it's practicable to use arbitary vector type in arguments in middle end (as long as they never link externally, i.e., ABI doesn't matter) and set "min-legal-vector-width" to the size of the largest vector. It also conforms to the behavior with no "min-legal-vector-width", which we will set it to the largest integer.
Unfortunately, I don't think we have documentation for it. @craig.topper, right?
I guess the hierarchy currently works like the following:
- CPU attributes control what register width codegen for a function can assume is available.
- prefer-vector-width reduces the max width of registers codegen actually uses, for the sake of avoiding vector ops that force the CPU to downclock etc.
- min-legal-vector-width makes codegen ignore prefer-vector-width, so intrinsics etc. work transparently.
I'd take "min-legal-vector-width" as a awful supplementary due to we implemented "prefer-vector-width" in an odd way. I talked with GCC folks, who told me GCC implemented it through providing different cost model. I wonder why didn't we doing the same way at the beginning. I can see it's more concise and won't introduce any ABI problems.
Then for calls:
- If a call a function with a very wide vector argument, it's split according to prefer-vector-width/min-legal-vector-width of the caller.
- If a function definition has a very wide vector argument, it's split according to prefer-vector-width/min-legal-vector-width on that definition.
That's not correct. Splitting is not allowed according to ABI. Splitting is just the mechanism how backend handles illegal vectors. So one of the problem prefer-vector-width/min-legal-vector-width is solviong is to avoid the backend takes ABI legal vactors as illegal and split them.
Therefore, if a function has a vector argument, the caller and the callee must have the same values of min-legal-vector-width/prefer-vector-width...?
It's not correct either. Caller and callee can have different values of min-legal-vector-width/prefer-vector-width as long as all the arguments are legal to their min-legal-vector-width. For example, caller can have min-legal-vector-width=512 and callee can have min-legal-vector-width=0 if there's no vector arguments passing. prefer-vector-width should not affect the arguments, because we have set min-legal-vector-width to the correct value. But if we didn't, the different prefer-vector-width might result in ABI mismatch.
I'm tempted to say we should have a separate abi-vector-width; this controls how vector arguments to functions are split, and no other function attributes have an effect. If we have that, we can ensure the ABI is consistent without worrying about prefer-vector-width/min-legal-vector-width/target features.
I don't think we need a abi-vector-width. For one thing, abi-vector-width can be inferred by target. For another, we never split arguments according to ABI.
Is this IR illegal because @bar has a min-legal-vector-width of 256 but has 512 bit argument? The inliner can't create broken code from it.
target triple = "x86_64"
define i32 @foo() #0 {
  %.val = load <32 x half>, <32 x half>* undef, align 4
  call void @bar(<32 x half> %.val)
  call void @baz()
  ret i32 0
}
define internal void @bar(<32 x half> %.0.val) #0 {
  ret void
}
declare void @qux() #1
define internal void @baz() #2 {
  call void @qux()
  ret void
}
attributes #0 = { noinline uwtable "min-legal-vector-width"="256" "target-cpu"="skylake-avx512" }
attributes #1 = { "target-cpu"="skylake-avx512" }
attributes #2 = { uwtable "min-legal-vector-width"="512" "target-cpu"="skylake-avx512" }I guess not.
I guess the hierarchy currently works like the following:
- CPU attributes control what register width codegen for a function can assume is available.
- prefer-vector-width reduces the max width of registers codegen actually uses, for the sake of avoiding vector ops that force the CPU to downclock etc.
- min-legal-vector-width makes codegen ignore prefer-vector-width, so intrinsics etc. work transparently.
I'd take "min-legal-vector-width" as a awful supplementary due to we implemented "prefer-vector-width" in an odd way. I talked with GCC folks, who told me GCC implemented it through providing different cost model. I wonder why didn't we doing the same way at the beginning. I can see it's more concise and won't introduce any ABI problems.
The vectorizer even with AVX2 frequently generates vectors with more than 256 bits and relies on the backend type legalizer to split them. In order to get similar codegen with prefer-vector-width=256 we had to create the same type legalizer behavior without taking away the vectorizer's freedom to generate oversized vectors. The min-legal-vector-width was trying to capture where the vectors came from to distinquish vectorizer code from user code.
Then for calls:
- If a call a function with a very wide vector argument, it's split according to prefer-vector-width/min-legal-vector-width of the caller.
- If a function definition has a very wide vector argument, it's split according to prefer-vector-width/min-legal-vector-width on that definition.
That's not correct. Splitting is not allowed according to ABI. Splitting is just the mechanism how backend handles illegal vectors. So one of the problem prefer-vector-width/min-legal-vector-width is solviong is to avoid the backend takes ABI legal vactors as illegal and split them.
Therefore, if a function has a vector argument, the caller and the callee must have the same values of min-legal-vector-width/prefer-vector-width...?
It's not correct either. Caller and callee can have different values of min-legal-vector-width/prefer-vector-width as long as all the arguments are legal to their min-legal-vector-width. For example, caller can have min-legal-vector-width=512 and callee can have min-legal-vector-width=0 if there's no vector arguments passing. prefer-vector-width should not affect the arguments, because we have set min-legal-vector-width to the correct value. But if we didn't, the different prefer-vector-width might result in ABI mismatch.
I'm tempted to say we should have a separate abi-vector-width; this controls how vector arguments to functions are split, and no other function attributes have an effect. If we have that, we can ensure the ABI is consistent without worrying about prefer-vector-width/min-legal-vector-width/target features.
I don't think we need a abi-vector-width. For one thing, abi-vector-width can be inferred by target. For another, we never split arguments according to ABI.
I'd take "min-legal-vector-width" as a awful supplementary due to we implemented "prefer-vector-width" in an odd way. I talked with GCC folks, who told me GCC implemented it through providing different cost model. I wonder why didn't we doing the same way at the beginning.
I think given the penalty for even a single zmm instruction, the original authors wanted to make sure it was impossible for SelectionDAG to generate zmm registers by accident. So instead of just trying to avoid them in IR, prefer-vector-width actually marks the types illegal in SelectionDAG. Given that, min-legal-vector-width exists to ensure that code that explicitly uses zmm intrinsics actually uses zmm registers.
In theory, the backend could try to deduce min-legal-vector-width, but accurately scanning is a little tricky, and it might lead to weird results like "user uses zmm intrinsic, but the backend decides to split it into multiple ymm instructions".
Not sure I really agree with this design, but that's the motivation.
For another, we never split arguments according to ABI.
To clarify about ABI here: we don't only care about the formal x86 ABI specification here, which only specifies rules for m128/m256/etc. vector types. We also care about doing something sane for languages/frontends that have language-level support for arbitrary vector types. We don't necessarily have to "split" here, but we have to do something, and splitting is the natural default given the way SelectionDAG works.
Is this IR illegal because @bar has a min-legal-vector-width of 256 but has 512 bit argument? The inliner can't create broken code from it.
For one thing, if the IR is illegal, shouldn't we fix the place where it turns illegal, i.e., ArgumentPromotion in this example?
For another, we are inlining baz to foo rather than bar. Do you mean we should scan all other callees of foo before the inlining? It doesn't make sense to me.
I think I wrote "inliner can't create" when I mean to write "inliner can create". I'm fine with fixing the ArgumentPromotion pass if the answer to the question of this IR illegal/unsupported is yes.
Oh, I see your point. It's true inliner breaks the code but it's not inliner's problem. As for IR illegal or not, I think it depends on how we explain "min-legal-vector-width".
The min-legal-vector-width was trying to capture where the vectors came from to distinquish vectorizer code from user code.
I see the initial purpose of adding "min-legal-vector-width", from that point of view, the IR is legal because it's compiler rather than user creates vector arguments.
But the backend implementaion actually takes it as ABI indicator, which determines whether we will use AVX2 calling conversion of AVX512. So I think it is not legal when vector size doesn't match with "min-legal-vector-width".
The vectorizer even with AVX2 frequently generates vectors with more than 256 bits and relies on the backend type legalizer to split them. In order to get similar codegen with prefer-vector-width=256 we had to create the same type legalizer behavior without taking away the vectorizer's freedom to generate oversized vectors.
I think there are 2 different cases here:
- Vectorizer with AVX2 generates 512 bits vector in the function body only;
- Vectorizer generates 512 bits vector in function arguments;
Both will be affected by prefer-vector-width/min-legal-vector-width. But 2) will result in ABI compatibility issue, which is a bug to codegen. ArgPromotion + Inliner is the first one I met with community code, but we have more in our down stream compiler.
If middle end pass cares about ABI compatibility, it should handle ABI for the arguments it modifing. That says, when vectorizer with AVX2 generates 512 bits vector, it should turns it into byval rather than put a 512 bits vector in arguments directly. If it creates 256 bits vector with AVX2 or 512 bits vector with AVX512, it should make sure min-legal-vector-width >= the size of the vector argument.
When a pass doesn't care ABI compatibility, it is free to create arbitrary vector, but it still needs to make sure min-legal-vector-width >= the size of the vector argument. In case it won't mismatch within the module which might be caused by other optimization, like the inliner problem we met here. Large value of min-legal-vector-width is always safe from the perspective of ABI.
Does it make sense to you?
Yeah, it doesn't matter to be vectorizer. Any pass touches function arguments has the problem.
But it's too strict. We are not "force" but "prefer" it :)
In theory, the backend could try to deduce min-legal-vector-width, but accurately scanning is a little tricky, and it might lead to weird results like "user uses zmm intrinsic, but the backend decides to split it into multiple ymm instructions".
Not sure I really agree with this design, but that's the motivation.
Thanks! I see the motivation. I think it's still fine if we don't modify arguments in middle end. But the problem this patch tries to fix as well as some down stream scenarios show the defect of the design and I want to see if we can find a better way.
For another, we never split arguments according to ABI.
To clarify about ABI here: we don't only care about the formal x86 ABI specification here, which only specifies rules for m128/m256/etc. vector types. We also care about doing something sane for languages/frontends that have language-level support for arbitrary vector types. We don't necessarily have to "split" here, but we have to do something, and splitting is the natural default given the way SelectionDAG works.
Yeah, I particularly meant to X86 psABI when I mentioned ABI somewhere. I think it's interesting to distinguish ABI and LLVM implementations. An ABI has its implementation, but not all implementations are specified by ABI documents.
For example, we have arbitrary scalar and vector types in IR, it's true we always be able to lower them. However, some of them are not ABI supported type, e.g, <8 x i1>, i192.
The important difference between ABI implementation and non ABI implementation is we are free to change the latter but have to always keep it the same for the former. We are continually changing the non ABI implementations once we added new supported type into ABI. We have done for 512 bit vector, we are doing for half type, we will going to do for i192 etc.
In a word, we don't have ABI for arbitrary types. We should call it ABI only for types defined on ABI documents.
In a word, we don't have ABI for arbitrary types. We should call it ABI only for types defined on ABI documents.
Not sure I agree with this... but I'll refer to the "ABI for types which aren't specified in the ABI document" as "unstable ABI". Anyway, it's really beside the point. My point was that we need some way ensure that all function/translation units in a given program agree on the "unstable ABI".
Due to the way "min-legal-vector-width" propagates, depending on it to determine the ABI inherently leads to weird results. For example, in general, it isn't legal to propagate "min-legal-vector-width" like this patch is doing: if a function contains any "unstable ABI" calls, or is itself "unstable ABI", it's illegal to increase its "min-legal-vector-width".
Hence my proposal for "abi-vector-width"; if the "unstable ABI" handling is determined by a separate attribute, the caller and callee aren't forced to have the same "min-legal-vector-width".
I don't think "unstable ABI" makes much sense. As far as I understand it, ABI is a protocol that makes functions callable and linkable. ABI can be upgraded, but the first priority is compatibility, both forward and backward. Because there's no ABI information left once we compile it into binary.
"abi-vector-width" doesn't help with that , unless we mangle the function with the ABI, something like foo.avx2, bar.avx512 etc.
So my point is there are only ABI and ABI undefined behaviors, i.e, non ABI implementations. "unstable ABI" is not ABI.
Here are concrete examples: https://godbolt.org/z/zna8f6631
What I want to show from the examples are:
- All the first 3 functions are not linkable by functions comply with AVX512 ABI;
- Both abi256_minlegal256 and abi256_minlegal512 are ABI undefined behaviors, they shouldn't be generated by user's code or any compiler optimizations. We don't need to care about them;
- abi512_minlegal256 is the bug we are fixing here;
- "target-cpu" has the same functionality with "abi-vector-width";
- Large "min-legal-vector-width" is always safe though vector width out of the capability of "target-cpu" is ABI undefined behavior;
Another reason increasing "min-legal-vector-width" is safe is we take no "min-legal-vector-width" attribute as UINT32_MAX: https://github.com/llvm/llvm-project/blob/3db9fd51b5158afae7352f60f5a89fecd0638210/llvm/lib/Target/X86/X86TargetMachine.cpp#L272-L283
So my point is there are only ABI and ABI undefined behaviors, i.e, non ABI implementations. "unstable ABI" is not ABI.
I'm translating this to "The formal ABI has a maximum vector width, and vectors in arguments are restricted to that width. The compiler should report a fatal error if it sees an vector wider than the maximum width." Am I understanding correctly? (I can't come up with any other interpretation of your use of "undefined" here.)
No. My understanding on "undefined" is it is allowed (for the sake of optimization) but not guaranteed (sacrificing ABI compatibility).
In this example https://godbolt.org/z/P59Gjjohv, I take the optimization as ABI undefined behavior, because it creates 512bit vector under AVX2 target. But it mostly doesn't matter because both caller and callee are in the same module, so it is allowed.
How about they are in different modules, or one of them is built into library? I believe we don't have real cases for it, because it's mostly error out by front end and middle end won't do such optimizations. But this is the reason why I don't consider them as ABI.
Anyway, I think our controversy on "unstable ABI" and "ABI undefined" won't affect what's the patch trying to fix, if we can make agreement on abi512_minlegal256 is a bug where arguments should be passed by ZMM register.
If a vector wider than the ABI width is not rejected, the behavior needs to be defined. Whatever the stability story, at least for any given caller and callee, they have to agree. (Undefined behavior doesn't make sense here; it should be either well-defined, or an error.)
Consider the following:
target triple = "x86_64"
define i32 @caller(<32 x i16>* %a) #0 {
  call void @callee1(<32 x i16>* %a)
  call void @callee2(<32 x i16> zeroinitializer)
  ret i32 0
}
define internal void @callee1(<32 x i16>*) #0 {
  %2 = load <32 x i16>, <32 x i16>* %0, align 4
  ret void
}
define void @callee2(<32 x i16>) #0 {
  ret void
}
attributes #0 = { noinline "target-cpu"="skylake-avx512" "min-legal-vector-width"="256" }Without this patch, we run -argpromotion, everything is fine. With this patch, we mark up caller and callee1, but not callee2, so we've broken the call to callee2.
Given that, I think there are only two possibilities here:
- This patch is wrong: we can't increase min-vector-width like this patch is doing.
- The IR is ill-formed in some way, and the verifier should emit an error.
Note that this is complicated by the fact that we support IR where different functions have different target attributes, so a function that doesn't support avx-512 can call one that does, or vice versa. (This is usually used with runtime CPU detection.)
I see the point. So we should make sure no callee2 that with ill-formed min-legal-vector-width at any cases.
Given that, I think there are only two possibilities here:
- This patch is wrong: we can't increase min-vector-width like this patch is doing.
- The IR is ill-formed in some way, and the verifier should emit an error.
I stand for 2 :)
I put other 2 patches. D123125 has the same problem like this one, which I didn't notice without D123124. So the verifier is helpful.
Note that this is complicated by the fact that we support IR where different functions have different target attributes, so a function that doesn't support avx-512 can call one that does, or vice versa. (This is usually used with runtime CPU detection.)
I insist on my point: a vector wider than the ABI width is not guaranteed by ABI. Under such case, assigning min-legal-vector-width to the vector width is safe and better than keep it to 0.