This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Vector] Add a vblendps-based impl for transpose8x8 (both intrin and inline_asm)
ClosedPublic

Authored by nicolasvasilache on Nov 21 2021, 5:53 AM.

Details

Summary

This revision follows up on the conversation titled:

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

The revision adds a vblendps-based implementation for transpose8x8 and further distinguishes between and intrinsics and an inline_asm implementation.

This results in roughly 20% fewer cycles as reported by llvm-mca:

After this revision (intrinsic version, resolves to virtually identical assembly as per the llvm-dev discussion, no vblendps instruction is emitted):

Iterations:        100
Instructions:      5900
Total Cycles:      2415
Total uOps:        7300

Dispatch Width:    6
uOps Per Cycle:    3.02
IPC:               2.44
Block RThroughput: 24.0

Cycles with backend pressure increase [ 89.90% ]
Throughput Bottlenecks:
  Resource Pressure       [ 89.65% ]
  - SKXPort1  [ 0.04% ]
  - SKXPort2  [ 12.42% ]
  - SKXPort3  [ 12.42% ]
  - SKXPort5  [ 89.52% ]
  Data Dependencies:      [ 37.06% ]
  - Register Dependencies [ 37.06% ]
  - Memory Dependencies   [ 0.00% ]

After this revision (inline_asm version, vblendps instructions are indeed emitted):

Iterations:        100
Instructions:      6300
Total Cycles:      2015
Total uOps:        7700

Dispatch Width:    6
uOps Per Cycle:    3.82
IPC:               3.13
Block RThroughput: 20.0

Cycles with backend pressure increase [ 83.47% ]
Throughput Bottlenecks:
  Resource Pressure       [ 83.18% ]
  - SKXPort0  [ 14.49% ]
  - SKXPort1  [ 14.54% ]
  - SKXPort2  [ 19.70% ]
  - SKXPort3  [ 19.70% ]
  - SKXPort5  [ 83.03% ]
  - SKXPort6  [ 14.49% ]
  Data Dependencies:      [ 39.75% ]
  - Register Dependencies [ 39.75% ]
  - Memory Dependencies   [ 0.00% ]

An accessible copy of the conversation is available here.

Diff Detail

Event Timeline

nicolasvasilache requested review of this revision.Nov 21 2021, 5:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2021, 5:53 AM

Use uint64_t everywhere for masks.

ftynse accepted this revision.Nov 22 2021, 12:50 AM
ftynse added inline comments.
mlir/test/lib/Dialect/Vector/TestVectorTransforms.cpp
17

It won't save you because the build dependency is still there.

206

Yes if using InlineAsm is controlled by a pass option, you can check the same option here. Note that the dialects are never unloaded so if one instance of a pass loaded the dialect, it doesn't matter if the following ones skip it.

This revision is now accepted and ready to land.Nov 22 2021, 12:50 AM

LGTM. Just some minor comments. It's great to see that we can workaround the ISel issue.
Thanks!
Diego

mlir/include/mlir/Dialect/X86Vector/Transforms.h
30

Should we use uint for b0, b1, b2... since we are checking <=1?

100

Make "asm" implicit in the name somehow?

mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp
178

yeah, I think making "asm" implicit in the name would help here because after the using inline_asm is actually not clear with ones are intrintics and inline assembly.

dcaballe accepted this revision.Nov 22 2021, 2:00 AM
nicolasvasilache marked 5 inline comments as done.Nov 22 2021, 2:27 AM
nicolasvasilache added inline comments.
mlir/include/mlir/Dialect/X86Vector/Transforms.h
30

yes, thanks!

mlir/test/lib/Dialect/Vector/TestVectorTransforms.cpp
206

ok, dropping the TODO then, thanks!

nicolasvasilache marked 2 inline comments as done.

Address and rebase.

This revision was landed with ongoing or failed builds.Nov 22 2021, 2:32 AM
This revision was automatically updated to reflect the committed changes.

I reverted because this broke the mlid-windows bot (You didn't get notified?).

Nit: the official archives are here: https://lists.llvm.org/pipermail/llvm-dev/2021-November/153627.html. ;)