This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add bf16 storage support
ClosedPublic

Authored by Pierre-vh on Dec 6 2022, 12:24 AM.

Details

Summary
  • [Clang] Declare AMDGPU target as supporting BF16 for storage-only purposes on amdgcn
    • Add Sema & CodeGen tests cases.
    • Also add cases that D138651 would have covered as this patch replaces it.
  • [AMDGPU] Add BF16 storage-only support
    • Support legalization/dealing with bf16 operations in DAGIsel.
    • bf16 as a type remains illegal and is represented as i16 for storage purposes.

Diff Detail

Event Timeline

Pierre-vh created this revision.Dec 6 2022, 12:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 6 2022, 12:24 AM
Pierre-vh requested review of this revision.Dec 6 2022, 12:24 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 6 2022, 12:24 AM
Pierre-vh updated this revision to Diff 480431.Dec 6 2022, 4:34 AM
  • Only accept bf16 on AMDGCN; r600 doesn't support it (we could but it's not worth the effort I think; I'll look at it if we find out it's needed)
  • Remove bf16 types from a few register classes
arsenm requested changes to this revision.Dec 6 2022, 4:57 AM
arsenm added inline comments.
clang/lib/Basic/Targets/AMDGPU.h
119

Don't understand this mangling. What is u6?

clang/test/CodeGenCUDA/amdgpu-bf16.cu
31

should also test a load

clang/test/SemaCUDA/amdgpu-bf16.cu
44

check casts to different int and float types? Is construction of bf16 vectors allowed?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

The generic legalizer should have handled this?

5147–5160

Ditto

llvm/test/CodeGen/AMDGPU/bf16-ops.ll
2–5

Drop -verify-machineinstrs

23

Use opaque pointers

llvm/test/CodeGen/AMDGPU/bf16.ll
433

Missing v3 test

900

Ret of vectors

954

Don't use anonymous values. Also use opaque pointers

957

Should also test call argument, call return, passed in byval, sret, implicit sret, and passed in argument in overflow stack area

This revision now requires changes to proceed.Dec 6 2022, 4:57 AM
arsenm added inline comments.Dec 6 2022, 5:00 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
147

Do you really need to add this to a register class? The only thing this would be useful for is for the calling convention contexts, which should promote to i16. If you do that you don't need most of the rest of this patch

Pierre-vh updated this revision to Diff 480476.Dec 6 2022, 7:03 AM
Pierre-vh marked 12 inline comments as done.

Address comments: diff is lighter and added a lot more tests

Pierre-vh edited the summary of this revision. (Show Details)Dec 6 2022, 7:03 AM
Pierre-vh added inline comments.
clang/lib/Basic/Targets/AMDGPU.h
119

Not sure; for that one I just copy-pasted the implementation of other targets. All other targets use that mangling scheme

clang/test/SemaCUDA/amdgpu-bf16.cu
44

Added cast + vec sema test and vec assign codegen test too

No conversions are allowed apparently but I don't think it matters for the initial patch; if needed we can always add it later I think

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

It looks like those operations are not implemented in the generic legalizer, e.g. I get

Do not know how to promote this operator's operand!
arsenm added inline comments.Dec 6 2022, 7:28 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

Right, this is the code that would go there

Pierre-vh added inline comments.Dec 6 2022, 7:33 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

Do I just copy/paste this code in that PromoteInt function, and keep a copy here too in LowerOperation? (not really a fan of copy-pasting code in different files, I'd rather keep it all here)
We need to have the lowering too AFAIK, it didn't go well when I tried to remove it

arsenm added inline comments.Dec 6 2022, 8:15 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

I'm not following why you need to handle it here

Pierre-vh added inline comments.Dec 7 2022, 12:07 AM
clang/lib/Basic/Targets/AMDGPU.h
119

Ah I remember now, it's just C++ mangling. I don't quite understand the lowercase "u" but a quick search in Clang tells me it's vendor-extended types.
So it's just u6 -> vendor extended type, 6 characters following + __bf16 (name of the type).

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

IIRC:

  • I need to handle FP_TO_BF16 in ReplaceNodeResult because that's what the Integer Legalizer calls (through CustomLowerNode)
  • I need to handle both opcodes in LowerOperation because otherwise they'll fail selection. They can be left over from expanding/legalizing other operations.
arsenm added inline comments.Dec 7 2022, 8:19 AM
clang/lib/Basic/Targets/AMDGPU.h
119

Do we really need an override for this? I'd expect a reasonable default. Plus I think a virtual function for something that's only a parameterless, static string is a bit ridiculous

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
49 ↗(On Diff #480476)

Without being added to a register class, all the tablegen changes should not do anything

Pierre-vh updated this revision to Diff 481169.Dec 8 2022, 12:27 AM
Pierre-vh marked 2 inline comments as done.

Comments

clang/lib/Basic/Targets/AMDGPU.h
119

Default impl asserts if not implemented. I think it's to make sure targets are all aware of what it takes to support bfloat and they don't end up partially implementing it?

/// Return the mangled code of bfloat.
virtual const char *getBFloat16Mangling() const {
  llvm_unreachable("bfloat not implemented on this target");
}

I'd say let's stick to the current pattern in this diff; I created D139608 to change it

llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
49 ↗(On Diff #480476)

bf16 ones seem to not be needed but if I don't have the v2bf16 ones I get "cannot allocate arguments" in "test_arg_store_v2bf16"

arsenm added inline comments.Dec 8 2022, 7:43 AM
llvm/lib/Target/AMDGPU/AMDGPUCallingConv.td
49 ↗(On Diff #480476)

Sounds like a type legality logic bug which I don't want to spend time fighting

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
911

If you wanted the promote to i32, you could have done it here instead of in the tablegen cc handling

4796–4808

But why are they custom? We don't have to handle FP16_TO_FP or FP_TO_FP16 there, and they aren't custom lowered. They have the same basic properties. We have this:

setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote);
AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32);
setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote);
AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32);

I'd expect the same basic pattern

5552

Op.getValueType()

5557

Should be specific cast, not FPExtOrRound. I don't think the FP_ROUND case would be correct

5563

Should use Op.getValueType() instead of Op->getValueType(0)

5567–5570

ExpandNode covers lowering BF16_TO_FP. It also has a shift by 16-bits into the high bits. Is this correct?

Pierre-vh updated this revision to Diff 481557.Dec 9 2022, 1:31 AM
Pierre-vh marked 9 inline comments as done.

Address some comments; I wasn't able to get rid of the custom legalization/CC tablegen.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
911

Do you mean somewhere else in that function? Changing v2bf16 to i32 here doesn't fix it
I also tried changing the function above but I kept running into asserts so I just left the TableGen CC for now

4796–4808

PromoteIntegerOperand, PromoteFloatOperand and PromoteIntegerResult don't handle FP_TO_BF16 and BF16_TO_FP, and unless we put a Custom lowering mode it'll assert/unreachable.
I tried to make it work (for a while) using the default expand but I can't quite get it to work. It feels like there is some legalizer work missing for handling BF16 like we want to.
Even though it's not ideal I think the custom lowering is easiest

5557

But we need to do f32 -> f16, isn't FP_ROUND used for that? I thought it's what we needed

5567–5570

Ah I didn't know that, though as long as we use custom lowering, and our FP_TO_BF16/BF16_TO_FP methods are consistent, it should be fine, no?

arsenm added inline comments.Dec 9 2022, 9:23 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
911

Yes, that should force the bitcast of the argument type

4796–4808

What about Expand? that's where the implemented part is

5567–5570

bfloat16 has the same number of exponent bits in the same high bits as f32; I kind of think the idea is you can just do a bitshift and then operate on f32? I think the fp_extend here is wrong

arsenm added inline comments.Dec 9 2022, 9:39 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
5567–5570

The default legalization also looks wrong to me. I don't understand why it isn't shifting down the mantissa bit

Pierre-vh marked 3 inline comments as done.

Fix f32/bf16 conversions, remove CC tablegen

Pierre-vh edited the summary of this revision. (Show Details)Dec 12 2022, 12:51 AM
Pierre-vh added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

Last I tried, Expand will emit a libcall in many cases that we don't handle

5567–5570

Indeed it was terribly wrong. I rewrote both legalizations following what I found online: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

bf16 is designed to be very easily convertible from/to f32, save for some edge cases with denormalized numbers I think, thus:

  • bf16 -> f32 is just left-shift by 16, filling the least-significant bits with zeroes.
  • f32 -> bf16 is just cutting off the 16 least-significant bits.
ye-luo added a subscriber: ye-luo.Dec 12 2022, 2:10 PM
arsenm added inline comments.Dec 12 2022, 2:51 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

Library call is supposed to be a distinct action now, the DAG only did about 5% of the work to migrate to using it. This code can go to the default expand action

5557

This is just this

5571

can just hardcode i32

5583

This is just this

5589

This can be any_extend and the combiner will probably turn it into one

5591

Can just hardcode i32

llvm/test/CodeGen/AMDGPU/bf16.ll
12

Doesn't cover the different f32/f64 conversions?

2954

Use poison instead of undef in tests

Pierre-vh marked 7 inline comments as done.

Comments + add (f32/f64) from/to bf16 conversion tests

Pierre-vh added inline comments.Dec 13 2022, 12:36 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
4796–4808

Does it need to happen in this commit? It'll delay the review quite a bit I think if other people have to review it
If it needs to happen, when what do I need to do? Use the Expand action & fix the legalizer in places where it needs to be fixed?

I feel like it might be better suited for a follow-up patch; I can create a task and pick it up when I come back from vacation if you want

Pierre-vh updated this revision to Diff 482459.Dec 13 2022, 6:59 AM

Move to common DAG legalization code

arsenm accepted this revision.Dec 13 2022, 7:07 AM

LGTM. with nits The lower could handle the vector case easily, but it didn't before either

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2915

Braces

2943

Braces

2947

Don't see the point of this assert, should work fine for vectors

This revision is now accepted and ready to land.Dec 13 2022, 7:07 AM
Pierre-vh updated this revision to Diff 482476.Dec 13 2022, 7:34 AM
Pierre-vh marked 3 inline comments as done.

Comments

This revision was landed with ongoing or failed builds.Dec 13 2022, 7:34 AM
This revision was automatically updated to reflect the committed changes.