- [Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
- Add Sema & CodeGen tests cases.
- Also add cases that D138651 would have covered as this patch replaces it.
- [AMDGPU] Add BF16 storage-only support
- Support legalization/dealing with bf16 operations in DAGIsel.
- bf16 as a type remains illegal and is represented as i16 for storage purposes.
Details
- Reviewers
arsenm foad yaxunl - Commits
- rG678d8946ba2b: [AMDGPU] Add bf16 storage support
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
- Only accept bf16 on AMDGCN; r600 doesn't support it (we could but it's not worth the effort I think; I'll look at it if we find out it's needed)
- Remove bf16 types from a few register classes
clang/lib/Basic/Targets/AMDGPU.h | ||
---|---|---|
119 | Don't understand this mangling. What is u6? | |
clang/test/CodeGenCUDA/amdgpu-bf16.cu | ||
31 | should also test a load | |
clang/test/SemaCUDA/amdgpu-bf16.cu | ||
44 | check casts to different int and float types? Is construction of bf16 vectors allowed? | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
4796–4808 | The generic legalizer should have handled this? | |
5147–5160 | Ditto | |
llvm/test/CodeGen/AMDGPU/bf16-ops.ll | ||
2–5 | Drop -verify-machineinstrs | |
23 | Use opaque pointers | |
llvm/test/CodeGen/AMDGPU/bf16.ll | ||
433 | Missing v3 test | |
900 | Ret of vectors | |
954 | Don't use anonymous values. Also use opaque pointers | |
957 | Should also test call argument, call return, passed in byval, sret, implicit sret, and passed in argument in overflow stack area |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
147 | Do you really need to add this to a register class? The only thing this would be useful for is for the calling convention contexts, which should promote to i16. If you do that you don't need most of the rest of this patch |
clang/lib/Basic/Targets/AMDGPU.h | ||
---|---|---|
119 | Not sure; for that one I just copy-pasted the implementation of other targets. All other targets use that mangling scheme | |
clang/test/SemaCUDA/amdgpu-bf16.cu | ||
44 | Added cast + vec sema test and vec assign codegen test too No conversions are allowed apparently but I don't think it matters for the initial patch; if needed we can always add it later I think | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
4796–4808 | It looks like those operations are not implemented in the generic legalizer, e.g. I get Do not know how to promote this operator's operand! |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
4796–4808 | Right, this is the code that would go there |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
4796–4808 | Do I just copy/paste this code in that PromoteInt function, and keep a copy here too in LowerOperation? (not really a fan of copy-pasting code in different files, I'd rather keep it all here) |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
4796–4808 | I'm not following why you need to handle it here |
clang/lib/Basic/Targets/AMDGPU.h | ||
---|---|---|
119 | Ah I remember now, it's just C++ mangling. I don't quite understand the lowercase "u" but a quick search in Clang tells me it's vendor-extended types. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
4796–4808 | IIRC:
|
clang/lib/Basic/Targets/AMDGPU.h | ||
---|---|---|
119 | Do we really need an override for this? I'd expect a reasonable default. Plus I think a virtual function for something that's only a parameterless, static string is a bit ridiculous | |
llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td | ||
49 ↗ | (On Diff #480476) | Without being added to a register class, all the tablegen changes should not do anything |
Comments
clang/lib/Basic/Targets/AMDGPU.h | ||
---|---|---|
119 | Default impl asserts if not implemented. I think it's to make sure targets are all aware of what it takes to support bfloat and they don't end up partially implementing it? /// Return the mangled code of bfloat. virtual const char *getBFloat16Mangling() const { llvm_unreachable("bfloat not implemented on this target"); } I'd say let's stick to the current pattern in this diff; I created D139608 to change it | |
llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td | ||
49 ↗ | (On Diff #480476) | bf16 ones seem to not be needed but if I don't have the v2bf16 ones I get "cannot allocate arguments" in "test_arg_store_v2bf16" |
llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td | ||
---|---|---|
49 ↗ | (On Diff #480476) | Sounds like a type legality logic bug which I don't want to spend time fighting |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
911 | If you wanted the promote to i32, you could have done it here instead of in the tablegen cc handling | |
4796–4808 | But why are they custom? We don't have to handle FP16_TO_FP or FP_TO_FP16 there, and they aren't custom lowered. They have the same basic properties. We have this: setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote); AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32); setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote); AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32); I'd expect the same basic pattern | |
5552 | Op.getValueType() | |
5557 | Should be specific cast, not FPExtOrRound. I don't think the FP_ROUND case would be correct | |
5563 | Should use Op.getValueType() instead of Op->getValueType(0) | |
5567–5570 | ExpandNode covers lowering BF16_TO_FP. It also has a shift by 16-bits into the high bits. Is this correct? |
Address some comments; I wasn't able to get rid of the custom legalization/CC tablegen.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
911 | Do you mean somewhere else in that function? Changing v2bf16 to i32 here doesn't fix it | |
4796–4808 | PromoteIntegerOperand, PromoteFloatOperand and PromoteIntegerResult don't handle FP_TO_BF16 and BF16_TO_FP, and unless we put a Custom lowering mode it'll assert/unreachable. | |
5557 | But we need to do f32 -> f16, isn't FP_ROUND used for that? I thought it's what we needed | |
5567–5570 | Ah I didn't know that, though as long as we use custom lowering, and our FP_TO_BF16/BF16_TO_FP methods are consistent, it should be fine, no? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
911 | Yes, that should force the bitcast of the argument type | |
4796–4808 | What about Expand? that's where the implemented part is | |
5567–5570 | bfloat16 has the same number of exponent bits in the same high bits as f32; I kind of think the idea is you can just do a bitshift and then operate on f32? I think the fp_extend here is wrong |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
5567–5570 | The default legalization also looks wrong to me. I don't understand why it isn't shifting down the mantissa bit |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
4796–4808 | Last I tried, Expand will emit a libcall in many cases that we don't handle | |
5567–5570 | Indeed it was terribly wrong. I rewrote both legalizations following what I found online: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format bf16 is designed to be very easily convertible from/to f32, save for some edge cases with denormalized numbers I think, thus:
|
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
4796–4808 | Library call is supposed to be a distinct action now, the DAG only did about 5% of the work to migrate to using it. This code can go to the default expand action | |
5557 | This is just this | |
5571 | can just hardcode i32 | |
5583 | This is just this | |
5589 | This can be any_extend and the combiner will probably turn it into one | |
5591 | Can just hardcode i32 | |
llvm/test/CodeGen/AMDGPU/bf16.ll | ||
12 | Doesn't cover the different f32/f64 conversions? | |
2954 | Use poison instead of undef in tests |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
4796–4808 | Does it need to happen in this commit? It'll delay the review quite a bit I think if other people have to review it I feel like it might be better suited for a follow-up patch; I can create a task and pick it up when I come back from vacation if you want |
Don't understand this mangling. What is u6?