This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Use the size of the widest type of the matrix multiplication operands
ClosedPublic

Authored by gareevroman on Jan 29 2017, 11:26 PM.

Details

Summary

The size of the operands type is the one of the parameters required to determine the BLIS micro-kernel. We get the size of the widest type of the matrix multiplication operands in case there are several different types.

Diff Detail

Event Timeline

Meinersbur accepted this revision.Jan 31 2017, 6:07 AM

Undoubtedly this change makes sense.

lib/Transform/ScheduleOptimizer.cpp
907–908

Is there a guarantee/check somewhere that A and B are both primitive types? I wonder about that e.g. A is a float, but B is a larger struct, so B->getPrimitiveSizeInBits() would return zero. getMatMulTypeSize would still return nonzero, but we'd not get enough space in a register for elements of B.

test/ScheduleOptimizer/pattern-matching-based-opts_6.ll
19–102

I don't see how the elements per register result manifests in this schedule. Is it the number of Stmt_for_body6 in the innermost loop for register tiling? Cold you add a small comment what these tests are supposed to check?

This revision is now accepted and ready to land.Jan 31 2017, 6:07 AM

Hi Michael,

thanks for the comments! I've tried to address them and also fix the issue related to the missed hard-coded type size.

Is there a guarantee/check somewhere that A and B are both primitive types? I wonder about that e.g. A is a float, but B is a larger struct, so B->getPrimitiveSizeInBits() would return zero. getMatMulTypeSize would still return nonzero, but we'd not get enough space in a register for elements of B.

Right. Could we use getTypeAllocSize to get sizes?

Is there a guarantee/check somewhere that A and B are both primitive types? I wonder about that e.g. A is a float, but B is a larger struct, so B->getPrimitiveSizeInBits() would return zero. getMatMulTypeSize would still return nonzero, but we'd not get enough space in a register for elements of B.

Right. Could we use getTypeAllocSize to get sizes?

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register. I suggest to use getTypeSizeInBits().

Michael

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register.

I've tried to reproduce it on x86-64 (the test case can be found in https://reviews.llvm.org/D29814). However, getTypeAllocSize() returns 1 for 8-bit char. Could you please advise me how to reproduce it?

I suggest to use getTypeSizeInBits().

Right. We should probably use it to compute the number of elements that can be held by a vector register.

However, in case of mapping elements to L1 cache ([1], p. 11), we should probably use getTypeAllocSize(), since we rely on the location of consecutive data in memory.

This approach is implemented in the new version of the patch.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register.

I've tried to reproduce it on x86-64 (the test case can be found in https://reviews.llvm.org/D29814). However, getTypeAllocSize() returns 1 for 8-bit char. Could you please advise me how to reproduce it?

It is a bit more complicated than I imagined. What this takes into account is "ABIAlignment", which depends on the platform. What we are looking for is a type that occupies more space per element in an array than sizeof() returns. This is possible with a struct { int i; char i; },

An example without struct is X86's long double type with an TypeSize of 80 bits and an AllocSize of 128 bits.

I suggest to use getTypeSizeInBits().

Right. We should probably use it to compute the number of elements that can be held by a vector register.

However, in case of mapping elements to L1 cache ([1], p. 11), we should probably use getTypeAllocSize(), since we rely on the location of consecutive data in memory.

Agreed.

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register.

I've tried to reproduce it on x86-64 (the test case can be found in https://reviews.llvm.org/D29814). However, getTypeAllocSize() returns 1 for 8-bit char. Could you please advise me how to reproduce it?

It is a bit more complicated than I imagined. What this takes into account is "ABIAlignment", which depends on the platform. What we are looking for is a type that occupies more space per element in an array than sizeof() returns. This is possible with a struct { int i; char i; },

An example without struct is X86's long double type with an TypeSize of 80 bits and an AllocSize of 128 bits.

OK. Thanks.

This revision was automatically updated to reflect the committed changes.