⚙ D98247 [X86][AMX] Prevent transforming load pointer from <256 x i32>* to x86

LuoYuanke created this revision.Mar 9 2021, 3:57 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMar 9 2021, 3:57 AM

LuoYuanke requested review of this revision.Mar 9 2021, 3:57 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2021, 3:57 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

LuoYuanke edited the summary of this revision. (Show Details)Mar 9 2021, 4:01 AM

LuoYuanke added reviewers: craig.topper, pengfei, yubing.

LuoYuanke added a subscriber: annita.zhang.

I'm not really convinced this is moving in the right direction.
other side of the problem should be fixed.

In D98247#2613648, @lebedev.ri wrote:

I'm not really convinced this is moving in the right direction.
other side of the problem should be fixed.

Thanks for review. Prohibiting the transform of pointer cast from <256 x i32>* to x86_amx* looks straight forward. I also try another way in patch https://reviews.llvm.org/D93788, but it seems not as good as prohibit pointer cast. Do you have any suggestions? I would like to try if you have any good idea on it.

In D98247#2613689, @LuoYuanke wrote:

In D98247#2613648, @lebedev.ri wrote:

I'm not really convinced this is moving in the right direction.
other side of the problem should be fixed.

Thanks for review. Prohibiting the transform of pointer cast from <256 x i32>* to x86_amx* looks straight forward.

Hm. Two things: pointer types are going away, and we can't make any optimizations based on the pointer type regardless.

I also try another way in patch https://reviews.llvm.org/D93788, but it seems not as good as prohibit pointer cast. Do you have any suggestions? I would like to try if you have any good idea on it.

This revision now requires changes to proceed.Mar 9 2021, 4:59 AM

Hm. Two things: pointer types are going away, and we can't make any optimizations based on the pointer type regardless.

Would you post any linkage for the information that pointer types are going away? I'd like to investigate how it will impact AMX.
This patch is not for optimization. Since there is no HW instruction corresponding to "load x86_amx, x86_amx*", I'd like keep the load as "load <256 x i32>, <256 x i32>*" form.

In D98247#2613716, @LuoYuanke wrote:

Hm. Two things: pointer types are going away, and we can't make any optimizations based on the pointer type regardless.

Would you post any linkage for the information that pointer types are going away? I'd like to investigate how it will impact AMX.

See e.g. https://lists.llvm.org/pipermail/llvm-dev/2019-December/137684.html

This patch is not for optimization. Since there is no HW instruction corresponding to "load x86_amx, x86_amx*", I'd like keep the load as "load <256 x i32>, <256 x i32>*" form.

Let me reword: we can not use pointer element type to decide whether or not to perform a change.

See e.g. https://lists.llvm.org/pipermail/llvm-dev/2019-December/137684.html

There are already several pointer cast transform in LLVM. Does opaque pointer require change of all pointer cast in LLVM? Nevertheless, it seems opaque pointer is not related with this patch.

This patch is not for optimization. Since there is no HW instruction corresponding to "load x86_amx, x86_amx*", I'd like keep the load as "load <256 x i32>, <256 x i32>*" form.

Let me reword: we can not use pointer element type to decide whether or not to perform a change.

Do you have any proposal for my case? The transform break the lowering of "load x86_amx, x86_amx*". I'm appreciated if you have good ideas.

Harbormaster completed remote builds in B92845: Diff 329286.Mar 9 2021, 11:54 AM

Let me reword: we can not use pointer element type to decide whether or not to perform a change.

It's not precise to say we are using pointer element type. The DestTy is the dest type of a bitcast after phi. See the comments in line 2364.
What we are doing is to block the phi+bitcast optimization when it will change loading to amx* type. I don't think the opaque ptr solution can solve the problem. The amx type load is quite different from other loads. You can think its expensive. We need to reduce it, or at least stop optimizing to it.

pengfei added inline comments.Mar 9 2021, 8:57 PM

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
2398	Better to add a comment for the reason.

Address Pengfei's comments.

I do not understand how load x86_amx can cause trouble while load <256 x i32> works.
More generally, if this proceeds, what will happen if i directly feed the current (bad) IR to llc?
It will still have issues, will it not?
If they are interchangeable like that, why the lowering can not be extended to be agnostic of the actual type?

In D98247#2615789, @lebedev.ri wrote:

I do not understand how load x86_amx can cause trouble while load <256 x i32> works.
More generally, if this proceeds, what will happen if i directly feed the current (bad) IR to llc?
It will still have issues, will it not?
If they are interchangeable like that, why the lowering can not be extended to be agnostic of the actual type?

It is a long story, but let me try to explain.

We have dedicate x86_amx type for AMX hardware.
LLVM provide several intrinsics for AMX, and Clang have corresponding builtin that map to those intrinsics.
We don't have a new type in C language for AMX, we use vector type instead.
When Clang front-end create LLVM AMX intrinsics from AMX buildin, it cast the vector type <256 x i32> to x86_amx and feed x86_amx to AMX intrinsics. However there are still load/store <256 x i32> instructions generated in O0 build. So we create a pass to transform load/store <256 x i32> if the input or output of load/store is propagated from AMX intrinsics. We do the transform start from bitcast of <256 x i32> or x86_amx. The bitcast of <256 x i32>* to x86_amx is unexpected. And it cause much work to support it. So we want to prevent generating x86_amx* in our IR.

Harbormaster completed remote builds in B93010: Diff 329531.Mar 10 2021, 9:17 AM

@lebedev.ri, may I commit the patch? I think the risk of this patch is pretty low. We can revise our solution if we have any better ideas in the future.

LGTM anyway :)

Rebase.

I still believe this is a wrong approach and the code
that is having trouble with the 'optimized' should be fixed,
but i'll defer to @spatel.

Harbormaster completed remote builds in B93203: Diff 329823.Mar 11 2021, 4:25 AM

In D98247#2618631, @lebedev.ri wrote:

I still believe this is a wrong approach and the code
that is having trouble with the 'optimized' should be fixed,
but i'll defer to @spatel.

I don't know anything about the AMX type / functionality, so I'm probably not the best judge.
I agree that we want to avoid type-based hacks (but what does it mean that we even have target-specific types in IR?)...
OTOH there is already precedent for AMX exceptions in instcombine (and MMX before that). I think we managed to make some of the MMX hacks less obviously bad by excluding all target-specific types from a given transform. Is that a possibility here? That is, could we limit the transform using isIntOrIntVectorTy() or similar?

Address Sanjay's comments.

Herald added a subscriber: dexonsmith. · View Herald TranscriptMar 11 2021, 5:01 PM

I don't know anything about the AMX type / functionality, so I'm probably not the best judge.
I agree that we want to avoid type-based hacks (but what does it mean that we even have target-specific types in IR?)...
OTOH there is already precedent for AMX exceptions in instcombine (and MMX before that). I think we managed to make some of the MMX hacks less obviously bad by excluding all target-specific types from a given transform. Is that a possibility here? That is, could we limit the transform using isIntOrIntVectorTy() or similar?

Excluding all target-specific types from a given transform looks good to me. If we agree on this, I may refactor some of the code to exclude both MMX and AMX with a common type interface.

In D98247#2620989, @LuoYuanke wrote:

I don't know anything about the AMX type / functionality, so I'm probably not the best judge.
I agree that we want to avoid type-based hacks (but what does it mean that we even have target-specific types in IR?)...
OTOH there is already precedent for AMX exceptions in instcombine (and MMX before that). I think we managed to make some of the MMX hacks less obviously bad by excluding all target-specific types from a given transform. Is that a possibility here? That is, could we limit the transform using isIntOrIntVectorTy() or similar?

Excluding all target-specific types from a given transform looks good to me. If we agree on this, I may refactor some of the code to exclude both MMX and AMX with a common type interface.

I don't think all target-specific types always have the same behavior. So a functionality repartition looks better to me. For example, under this AMX circumstance, the limitation is we are not preferring any pointer type been bitcasted to AMX* since it much more "expensive". So we may need an interface something like

bool isTriviallyBitcastToPointer() const { return isX86_AMXTy(); }

Address Pengfei's comments.

Harbormaster completed remote builds in B93402: Diff 330108.Mar 11 2021, 11:31 PM

Fix bug. We need check bitcast is lossless.

In D98247#2615797, @LuoYuanke wrote:

And it cause much work to support it. So we want to prevent generating x86_amx* in our IR.

How much work?
I honestly still don't understand how handling the LHS version of IR in this diff is harder than RHS.

In D98247#2621399, @lebedev.ri wrote:

In D98247#2615797, @LuoYuanke wrote:

And it cause much work to support it. So we want to prevent generating x86_amx* in our IR.

How much work?
I honestly still don't understand how handling the LHS version of IR in this diff is harder than RHS.

2 aspects. 1) It will increase compiling time. We actually need to handle it in codegen by truning it into load <256 x i32> + bitcast to x86_amx and lower it to a tileload in pass "lower-amx-type". We also need to infer the tile shape by its use. 2) It will increase registers pressure of tile. We only have 8 tile registers. And what makes it worse is that the different shapes cannot be reused after tileconfig.
But if we keep it as non AMX ptr, it's probably lowed to optimizated memcpy. That's why I think it is "cheap".

Harbormaster completed remote builds in B93421: Diff 330147.Mar 12 2021, 2:17 AM

Harbormaster completed remote builds in B93428: Diff 330156.Mar 12 2021, 2:55 AM

@lebedev.ri, after applying Sanjay and Pengfei's suggestion for the patch, current code check if the the bitcast is lossless before transform. I think it conform the previous code infrastructure and looks neat. It is much better that previous patch. May I commit the patch now?

This is a backend bug.

This revision is now accepted and ready to land.Mar 12 2021, 10:07 PM

Closed by commit rG66fbf5fafb16: [X86][AMX] Prevent transforming load pointer from <256 x i32>* to x86_amx*. (authored by LuoYuanke). · Explain WhyMar 13 2021, 5:25 PM

This revision was automatically updated to reflect the committed changes.

LuoYuanke added a commit: rG66fbf5fafb16: [X86][AMX] Prevent transforming load pointer from <256 x i32>* to x86_amx*..

dexonsmith removed a subscriber: dexonsmith.Mar 15 2021, 1:31 PM

This is an archive of the discontinued LLVM Phabricator instance.

[X86][AMX] Prevent transforming load pointer from <256 x i32>* to x86_amx*.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 329823

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp

llvm/test/Transforms/InstCombine/X86/x86-amx.ll