This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Match udot8 pattern
ClosedPublic

Authored by FarhanaAleen on Sep 11 2018, 1:39 PM.

Details

Summary

D.u32 = S0.u4[0] * S1.u4[0] +

S0.u4[1] * S1.u4[1] +
S0.u4[2] * S1.u4[2] + 
S0.u4[3] * S1.u4[3] +
S0.u4[4] * S1.u4[4] + 
S0.u4[5] * S1.u4[5] +
S0.u4[6] * S1.u4[6] + 
S0.u4[7] * S1.u4[7] +
S2.u32

Negated form will be supported with idot8.

Diff Detail

Event Timeline

FarhanaAleen created this revision.Sep 11 2018, 1:39 PM

As for the testcases, what about vectorized multiplicaton, i.e.:

%vec1 = load <8 x i4>, ...
vec2 = load <8 x i4>, ...
%ext1 = zext <8 x i4> %vec1 to <8 x i32>
%ext2 = zext <8 x i4> %vec2 to <8 x i32>
%mul = mul nuw nsw <8 x i32> %ext1, %ext2
... then extractelement and add up the result
... or possibly the same thing without the zext

The TableGen itself looks good to me, except for one nitpick (inline).

lib/Target/AMDGPU/VOP3PInstructions.td
206–210

I realize this was already done this way for the 8bit case, but it would be cleaner to use Index 0-7 instead 1-8. 0-based indexing is more natural here, and would avoid the !add(.., -1).

FarhanaAleen added a reviewer: nhaehnle.

Thanks Nicolai.

  • Added testcases for the vector version of the multiplication.
  • Also updated the pattern to start with 0.
arsenm added inline comments.Sep 12 2018, 8:14 PM
test/CodeGen/AMDGPU/idot8.ll
299–308

This is a lot of hardcoded registers for a non-generated test. Either add the checks or use update_llc_test_checks?

FarhanaAleen added inline comments.Sep 13 2018, 12:25 PM
test/CodeGen/AMDGPU/idot8.ll
299–308

Hi Matt,

This is generated by update_llc_test_checks with a little modification. I just used common label for some of the common instructions and removed some of the instructions that are unrelated with this testing in order to have a concise form.

nhaehnle added inline comments.Sep 14 2018, 1:33 AM
test/CodeGen/AMDGPU/idot8.ll
299–308

But this means somebody will have to manually repeat that work when some arbitrary register allocation changes happen. Better to just take the raw output from update_llc_test_checks, since the common label/instructions don't actually seem to save that many lines.

Updated test checks purely generated by update_llc_test_checks.

Thanks, this mostly looks good to me. Looks like this may be running into a serious limitation of the ISel infrastructure with commutativity / associativity, but it makes sense to land this patch without addressing it. I do have one last question.

test/CodeGen/AMDGPU/idot8.ll
415–425

Why isn't this testcase using the v_dot instruction? Could this be fixed relatively easily by extending the MulU#Index#"_4bit" PatFrag with a second pattern?

Thanks, this mostly looks good to me. Looks like this may be running into a serious limitation of the ISel infrastructure with commutativity / associativity, but it makes sense to land this patch without addressing it. I do have one last question.

I have been thinking about different solutions to handle it. One easiest solution would be to put a threshold during permutation. Thanks, yes I would like to go ahead with this patch.

test/CodeGen/AMDGPU/idot8.ll
415–425

Yes, it can be fixed by writing a second pattern but that would be a workaround. The issue here is the additional 'and' instruction which should be removed(it's not generated in the earlier generation such as GFX7). Otherwise it will keep causing problems somewhere else and we might end up keep adding patterns or additional unnecessary/temporary code to work around it all those places. I will look in to it as a separate action.

nhaehnle accepted this revision.Sep 18 2018, 3:55 AM

Okay, thanks for the explanation, that seems fair.

This revision is now accepted and ready to land.Sep 18 2018, 3:55 AM
This revision was automatically updated to reflect the committed changes.