This is an archive of the discontinued LLVM Phabricator instance.

[mlir][tosa] Improve performance of tosa.transpose constant folding
ClosedPublic

Authored by sabauma on Mar 21 2023, 6:57 AM.

Details

Summary

Folding of the tosa.transpose operation is both time and memory
intensive as the underlying ElementsAttr is processed as a sequence of
Attributes. This change attempts operate on the underlying raw data of
the ElementsAttr.

In an example resnet50 network, this change reduces the time spent in
folding transpose ops from 35s to 1.5s.

Diff Detail

Event Timeline

sabauma created this revision.Mar 21 2023, 6:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2023, 6:57 AM
sabauma edited the summary of this revision. (Show Details)Mar 21 2023, 7:06 AM
sabauma updated this revision to Diff 507030.Mar 21 2023, 10:05 AM
sabauma edited the summary of this revision. (Show Details)

Add testcase and remove impossible case

sabauma updated this revision to Diff 507074.Mar 21 2023, 11:46 AM

Remove dead comment

sabauma updated this revision to Diff 507077.Mar 21 2023, 12:00 PM

Try to publish

sabauma updated this revision to Diff 507080.Mar 21 2023, 12:07 PM
sabauma retitled this revision from [mlir][tosa] Improve performance of tosa.transpose constant folding to Review Requested [mlir][tosa] Improve performance of tosa.transpose constant folding.

Request review

sabauma retitled this revision from Review Requested [mlir][tosa] Improve performance of tosa.transpose constant folding to Request Review [mlir][tosa] Improve performance of tosa.transpose constant folding.Mar 21 2023, 12:08 PM
sabauma retitled this revision from Request Review [mlir][tosa] Improve performance of tosa.transpose constant folding to [mlir][tosa] Improve performance of tosa.transpose constant folding.
sabauma edited the summary of this revision. (Show Details)
sabauma edited the summary of this revision. (Show Details)
sabauma updated this revision to Diff 507083.Mar 21 2023, 12:18 PM
sabauma retitled this revision from [mlir][tosa] Improve performance of tosa.transpose constant folding to [Request Review][mlir][tosa] Improve performance of tosa.transpose constant folding.
sabauma edited the summary of this revision. (Show Details)

Request Review

sabauma retitled this revision from [Request Review][mlir][tosa] Improve performance of tosa.transpose constant folding to [mlir][tosa] Improve performance of tosa.transpose constant folding.Mar 21 2023, 12:19 PM
sabauma published this revision for review.Mar 22 2023, 5:37 AM
sabauma updated this revision to Diff 507724.Mar 23 2023, 6:37 AM

Avoid converting data to mlir::Attributes in all cases

stellaraccident requested changes to this revision.Mar 24 2023, 8:19 AM

Thank you for the work!

mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp
26

I haven't looked at compiler explorer, but I'd be willing to bet that the mixing of signed and unsigned arithmetic in the indexing is not generating the best code. May be worth going completely signed (unless if factoring this to not be hot per below).

67

I'm not sure that breaking permuteLinearIndex out like this is paying for itself. As written, it will have very bad, per element memory overhead for rank > 6 (reallocating 2 SmallVector storages for each element).

Further, I suspect that those indexing tables can be constructed independent of it.index(), simplifying the per element computation.

I suggest inlining the indexing helper and minimally reusing the vectors for the entire iteration. Bear would be factoring more of the per element computation outside of the loop.

This revision now requires changes to proceed.Mar 24 2023, 8:19 AM
sabauma added inline comments.Mar 24 2023, 10:23 AM
mlir/lib/Dialect/Tosa/Transforms/TosaFoldConstantTranspose.cpp
67

I've reworked the indexing calculation logic to avoid the need to temporarily store the indices for the input and output tensors. The only overhead is to pre-compute the strides on the output tensor and invert the permutation map, both of which are loop invariant.

Based on my simple resnet50 example, this halves the runtime from my initial solution.

stellaraccident accepted this revision.Mar 24 2023, 10:32 AM

Thank you for going the extra mile. This looks good to me and is a much needed improvement.

This revision is now accepted and ready to land.Mar 24 2023, 10:32 AM
GeorgeARM accepted this revision.Mar 24 2023, 11:43 AM

That's really nice @sabauma! LGTM!

rsuderman accepted this revision.Mar 24 2023, 12:19 PM

@stellaraccident @GeorgeARM if this change looks acceptable, would one of you mind submitting it? I do not have commit access.

Nice speedup! Sorry I missed the review notification