This is an archive of the discontinued LLVM Phabricator instance.

[Matrix] Factor and distribute transposes across multiplies
ClosedPublic

Authored by anemet on May 18 2021, 4:22 PM.

Details

Summary

Now that we can fold some transposes into multiplies (CM: A * B^t and RM:
A^t * B), we want to move them around to create the optimal expressions:

  • fold away double transposes while still using them to assert the shape
  • sink transposes hoping they cancel out
  • lift transposes when both operands are transposed

This also modifies the matrix remarks to include the number of exposed
transposes (i.e. transposes that we couldn't fold into a multiply).

The adjustment to the test remarks-inlining is a bit subtle: I am changing the
double transpose to a single transpose so that we don't remove it completely.
More importantly this changes some of the total instruction count, most
notable stores because we can no longer use a vector store.

Diff Detail

Event Timeline

anemet created this revision.May 18 2021, 4:22 PM
anemet requested review of this revision.May 18 2021, 4:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 18 2021, 4:22 PM
anemet edited the summary of this revision. (Show Details)May 18 2021, 4:23 PM
fhahn added inline comments.May 20 2021, 6:28 AM
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
683

For the later transforms, we collect a worklist once which contains all matrix instructions. Could we use the same here to avoid having to iterate over each function again?

752

If we have a TT matmul, lift the transpose until we have a non-TT situation.

Is this comment accurate? IIUC we only convert TT multiplies to versions where we can fold one transpose into the multiply?

llvm/test/Transforms/LowerMatrixIntrinsics/transpose-and-multiply-fold.ll
11 ↗(On Diff #346295)

I think it would also be good to have tests that check the generated IR, together with some combinations with non-square matrixes.

fhahn added inline comments.May 20 2021, 6:32 AM
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
692

It should be enough to capture &II

anemet marked 3 inline comments as done.May 21 2021, 8:29 AM
anemet added inline comments.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
683

Unless we really think this is a performance issue, I'd like to avoid the extra bookkeeping and just represent everything in the IR and no on-the-side data structure that needs updating. As I was saying offline I think we already have too much bookkeeping going on (e.g. for the remarks) so it's hard to know what to update at times (). Having a backward and a forward matrix algebraic simplification pass (which is what optimizeTransposes is) that is logically separated from the lowering pass I think makes a good sense in terms of "separation of concerns". What do you think?

752

Rephrased the comment.

anemet marked an inline comment as done.May 21 2021, 8:59 AM
anemet added inline comments.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
683

(The only state that is live across optimizeTransposes is the shape-info so that we have gather as much shape info as possible before removing shape-carrying operations like a double-transpose. I should probably add a comment about this.)

fhahn added a comment.May 21 2021, 9:12 AM

We also have a stripped down version of the pass to run in the backend pipelines (LowerMatrixIntrinsicsMinimalLegacyPass) We should probably no perform the optimizations there.

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
683

Having a backward and a forward matrix algebraic simplification pass (which is what optimizeTransposes is) that is logically separated from the lowering pass I think makes a good sense in terms of "separation of concerns". What do you think?

I think the worklist for matrix instructions is slightly different, as it would not directly impact any code related to transformations, just the way we visit them. It's not needed right now, but I think we will need it sooner or later as add more simplficiations that may depend on each other.

Compile-time wise it should be fine, given that it is only run when the pass is explicitly enabled. But we should keep an eye on it, to avoid increasing compile time for our adopters too much.

fhahn added inline comments.May 21 2021, 9:22 AM
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
683

While looking at this, I realized that we can do an easy early exit if there are no matrix intrinsics at all D102931. That should alleviate the impact on functions without matrix code :)

anemet updated this revision to Diff 347587.May 24 2021, 11:21 PM

Address Florian's comments.

fhahn accepted this revision.May 25 2021, 8:35 AM

LGTM, thanks!

llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
207

nit: /// for doc-comment as the other members?

This revision is now accepted and ready to land.May 25 2021, 8:35 AM
This revision was landed with ongoing or failed builds.May 25 2021, 11:16 AM
This revision was automatically updated to reflect the committed changes.
int3 added a subscriber: int3.May 25 2021, 12:20 PM
int3 added inline comments.
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
820

I am getting a link-time failure when doing a local ninja lld:

Undefined symbols for architecture x86_64:
  "llvm::Value::dump() const", referenced from:
      (anonymous namespace)::LowerMatrixIntrinsics::Visit() in libLLVMScalarOpts.a(LowerMatrixIntrinsics.cpp.o)
ld: symbol(s) not found for architecture x86_64

Not sure if/how it passes the buildbots, but it does indeed look like Value.cpp doesn't implement dump().

820

(and commenting out this line does indeed fix my local build)

int3 added inline comments.May 25 2021, 12:23 PM
llvm/lib/Transforms/Scalar/LowerMatrixIntrinsics.cpp
820

nevermind, I see a fix has just been pushed :)