Page MenuHomePhabricator

[AMDGPU] Match udot4 pattern.
ClosedPublic

Authored by FarhanaAleen on Aug 17 2018, 1:47 PM.

Details

Summary

D.u32 = S0.u8[0] * S1.u8[0] +

S0.u8[1] * S1.u8[1] + 
S0.u8[2] * S1.u8[2] + 
S0.u8[3] * S1.u8[3] + S2.u32

Diff Detail

Repository
rL LLVM

Event Timeline

FarhanaAleen created this revision.Aug 17 2018, 1:47 PM

Can you also add tests/support for the negated form? i.e. -S0.u8[1] * S1.u8[1] - S0.u8[2] * S1.u8[2] - S0.u8[3] * S1.u8[3] - S2.u32. I'm not sure how this will canonicalize, but I don't think we do as much as we do with FP negates since we don't have int source modifiers

Can you also add tests/support for the negated form? i.e. -S0.u8[1] * S1.u8[1] - S0.u8[2] * S1.u8[2] - S0.u8[3] * S1.u8[3] - S2.u32. I'm not sure how this will canonicalize, but I don't think we do as much as we do with FP negates since we don't have int source modifiers

I would like to support it in a separate patch.

First of all, once we negate one of the operands, we will need to generate signed dot instruction. This patch only defines/matches unsigned dot4.

Secondly, I am not sure about the profitability. It's not going to be a trivial negate operation since the negation needs to happen on certain number of bits on any location inside a temp instead of on a temp. Roughly, we will need 12 instructions. May be there is a better way of doing it. Even then, to me it seems like it will defeat the whole purpose of having these new dot instructions as opposed to run them as a series of MAD. May be it's better to support it in the hardware by defining a separate dot4 instruction. I will perform some analysis. In any case, I would like to handle this in a separate patch.

Thanks.

Can you also add tests/support for the negated form? i.e. -S0.u8[1] * S1.u8[1] - S0.u8[2] * S1.u8[2] - S0.u8[3] * S1.u8[3] - S2.u32. I'm not sure how this will canonicalize, but I don't think we do as much as we do with FP negates since we don't have int source modifiers

I would like to support it in a separate patch.

First of all, once we negate one of the operands, we will need to generate signed dot instruction. This patch only defines/matches unsigned dot4.

Secondly, I am not sure about the profitability. It's not going to be a trivial negate operation since the negation needs to happen on certain number of bits on any location inside a temp instead of on a temp. Roughly, we will need 12 instructions. May be there is a better way of doing it. Even then, to me it seems like it will defeat the whole purpose of having these new dot instructions as opposed to run them as a series of MAD. May be it's better to support it in the hardware by defining a separate dot4 instruction. I will perform some analysis. In any case, I would like to handle this in a separate patch.

Thanks.

I mean you can factor out the negate to match this, the opposite of for fneg. As a separate patch is fine

arsenm added inline comments.Aug 21 2018, 8:20 AM
lib/Target/AMDGPU/VOP3PInstructions.td
199–202 ↗(On Diff #161317)

It's not immediately clear to me what MADU1 is supposed to mean. U1 = ?

204–209 ↗(On Diff #161317)

You generate all of these intermediate MAD PatFrags just to implement the final one. Can you use the new tablegen fold operator?

FarhanaAleen added inline comments.Aug 24 2018, 3:36 PM
lib/Target/AMDGPU/VOP3PInstructions.td
204–209 ↗(On Diff #161317)

It's not going to be tremendouly less if we were to implement it using fold operator.
This is how it would look like using fold operator.
def : GCNPat <

(!cast<dag>(!foldl(node:$src2, [1, 2, 3, 4], lhs, y,
                   (add_oneuse lhs, (!cast<PatFrag>("MulU_Elt"#y) node:$src0, node:$src1))))),
(V_DOT4_U32_U8 (i32 8), $src0, (i32 8), $src1, (i32 8), $src2, (i1 0))

;

Currently, it does not support a dag pattern. Looks like there is a bug in the fold input parser. I am debugging the root cause. In the meantime, I would like to go ahead with this patch. I will revisit this once the fold operator supports a dag pattern.

Defined the pattern using foldl (I was wrong, foldl does support DAG patterns).

arsenm accepted this revision.Aug 28 2018, 10:42 AM

LGTM

This revision is now accepted and ready to land.Aug 28 2018, 10:42 AM
Closed by commit rL340936: [AMDGPU] Match udot4 pattern. (authored by faaleen, committed by ). · Explain WhyAug 29 2018, 9:34 AM
This revision was automatically updated to reflect the committed changes.