The load/store instruction will be transformed to amx intrinsics in the pass of AMX type lowering. Prohibiting the pointer cast make that pass happy.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
140 ms | x64 windows > lld.MachO::function-starts.s |
Event Timeline
I'm not really convinced this is moving in the right direction.
other side of the problem should be fixed.
Thanks for review. Prohibiting the transform of pointer cast from <256 x i32>* to x86_amx* looks straight forward. I also try another way in patch https://reviews.llvm.org/D93788, but it seems not as good as prohibit pointer cast. Do you have any suggestions? I would like to try if you have any good idea on it.
Hm. Two things: pointer types are going away, and we can't make any optimizations based on the pointer type regardless.
I also try another way in patch https://reviews.llvm.org/D93788, but it seems not as good as prohibit pointer cast. Do you have any suggestions? I would like to try if you have any good idea on it.
Hm. Two things: pointer types are going away, and we can't make any optimizations based on the pointer type regardless.
Would you post any linkage for the information that pointer types are going away? I'd like to investigate how it will impact AMX.
This patch is not for optimization. Since there is no HW instruction corresponding to "load x86_amx, x86_amx*", I'd like keep the load as "load <256 x i32>, <256 x i32>*" form.
See e.g. https://lists.llvm.org/pipermail/llvm-dev/2019-December/137684.html
This patch is not for optimization. Since there is no HW instruction corresponding to "load x86_amx, x86_amx*", I'd like keep the load as "load <256 x i32>, <256 x i32>*" form.
Let me reword: we can not use pointer element type to decide whether or not to perform a change.
See e.g. https://lists.llvm.org/pipermail/llvm-dev/2019-December/137684.html
There are already several pointer cast transform in LLVM. Does opaque pointer require change of all pointer cast in LLVM? Nevertheless, it seems opaque pointer is not related with this patch.
This patch is not for optimization. Since there is no HW instruction corresponding to "load x86_amx, x86_amx*", I'd like keep the load as "load <256 x i32>, <256 x i32>*" form.
Let me reword: we can not use pointer element type to decide whether or not to perform a change.
Do you have any proposal for my case? The transform break the lowering of "load x86_amx, x86_amx*". I'm appreciated if you have good ideas.
Let me reword: we can not use pointer element type to decide whether or not to perform a change.
It's not precise to say we are using pointer element type. The DestTy is the dest type of a bitcast after phi. See the comments in line 2364.
What we are doing is to block the phi+bitcast optimization when it will change loading to amx* type. I don't think the opaque ptr solution can solve the problem. The amx type load is quite different from other loads. You can think its expensive. We need to reduce it, or at least stop optimizing to it.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp | ||
---|---|---|
2398 | Better to add a comment for the reason. |
I do not understand how load x86_amx can cause trouble while load <256 x i32> works.
More generally, if this proceeds, what will happen if i directly feed the current (bad) IR to llc?
It will still have issues, will it not?
If they are interchangeable like that, why the lowering can not be extended to be agnostic of the actual type?
It is a long story, but let me try to explain.
- We have dedicate x86_amx type for AMX hardware.
- LLVM provide several intrinsics for AMX, and Clang have corresponding builtin that map to those intrinsics.
- We don't have a new type in C language for AMX, we use vector type instead.
- When Clang front-end create LLVM AMX intrinsics from AMX buildin, it cast the vector type <256 x i32> to x86_amx and feed x86_amx to AMX intrinsics. However there are still load/store <256 x i32> instructions generated in O0 build. So we create a pass to transform load/store <256 x i32> if the input or output of load/store is propagated from AMX intrinsics. We do the transform start from bitcast of <256 x i32> or x86_amx. The bitcast of <256 x i32>* to x86_amx is unexpected. And it cause much work to support it. So we want to prevent generating x86_amx* in our IR.
@lebedev.ri, may I commit the patch? I think the risk of this patch is pretty low. We can revise our solution if we have any better ideas in the future.
I still believe this is a wrong approach and the code
that is having trouble with the 'optimized' should be fixed,
but i'll defer to @spatel.
I don't know anything about the AMX type / functionality, so I'm probably not the best judge.
I agree that we want to avoid type-based hacks (but what does it mean that we even have target-specific types in IR?)...
OTOH there is already precedent for AMX exceptions in instcombine (and MMX before that). I think we managed to make some of the MMX hacks less obviously bad by excluding all target-specific types from a given transform. Is that a possibility here? That is, could we limit the transform using isIntOrIntVectorTy() or similar?
I don't know anything about the AMX type / functionality, so I'm probably not the best judge.
I agree that we want to avoid type-based hacks (but what does it mean that we even have target-specific types in IR?)...
OTOH there is already precedent for AMX exceptions in instcombine (and MMX before that). I think we managed to make some of the MMX hacks less obviously bad by excluding all target-specific types from a given transform. Is that a possibility here? That is, could we limit the transform using isIntOrIntVectorTy() or similar?
Excluding all target-specific types from a given transform looks good to me. If we agree on this, I may refactor some of the code to exclude both MMX and AMX with a common type interface.
I don't think all target-specific types always have the same behavior. So a functionality repartition looks better to me. For example, under this AMX circumstance, the limitation is we are not preferring any pointer type been bitcasted to AMX* since it much more "expensive". So we may need an interface something like
bool isTriviallyBitcastToPointer() const { return isX86_AMXTy(); }
How much work?
I honestly still don't understand how handling the LHS version of IR in this diff is harder than RHS.
2 aspects. 1) It will increase compiling time. We actually need to handle it in codegen by truning it into load <256 x i32> + bitcast to x86_amx and lower it to a tileload in pass "lower-amx-type". We also need to infer the tile shape by its use. 2) It will increase registers pressure of tile. We only have 8 tile registers. And what makes it worse is that the different shapes cannot be reused after tileconfig.
But if we keep it as non AMX ptr, it's probably lowed to optimizated memcpy. That's why I think it is "cheap".
@lebedev.ri, after applying Sanjay and Pengfei's suggestion for the patch, current code check if the the bitcast is lossless before transform. I think it conform the previous code infrastructure and looks neat. It is much better that previous patch. May I commit the patch now?
Better to add a comment for the reason.