This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Load into zero vector patterns
ClosedPublic

Authored by dmgreen on Feb 15 2023, 3:16 AM.

Details

Summary

A LDR will implicitly zero the rest of the vector, so vector_insert(zeros, load, 0) can use a single load. This adds tablegen patterns for both scaled and unscaled loads, detecting where we are inserting a load into the lower element of a zero vector.

Diff Detail

Event Timeline

dmgreen created this revision.Feb 15 2023, 3:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2023, 3:16 AM
dmgreen requested review of this revision.Feb 15 2023, 3:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 15 2023, 3:16 AM
SjoerdMeijer accepted this revision.Feb 16 2023, 12:41 AM

Very nice.

This revision is now accepted and ready to land.Feb 16 2023, 12:41 AM
This revision was landed with ongoing or failed builds.Mar 1 2023, 5:54 AM
This revision was automatically updated to reflect the committed changes.
Benoit added a subscriber: Benoit.Mar 7 2023, 7:10 PM

This commit (https://github.com/llvm/llvm-project/commit/83bbd3fdbd75295669cf97967c38810d427c5c25) causes a regression in a downstream project: https://github.com/openxla/iree/issues/12546.

The effect is incorrect results in matrix multiplications, where the result is now filled with zeros instead of the correct, nonzero matrix entries. I will try to debug this some more.

Benoit added a comment.Mar 7 2023, 8:41 PM

Minimized end-to-end MLIR testcase here: https://github.com/openxla/iree/pull/12556 :

To make this more helpful, here is:

Hi - thanks for the report. It sounds like the offset might be wrong from the look at the assembly. This instructions specifically:

10534: 40 f4 7f 3d  	ldr	b0, [x2, #4093]
vs
10534: 46 0c 00 d1  	sub	x6, x2, #3
1053c: c0 00 40 0d  	ld1	{ v0.b }[0], [x6]

I think I see the problem - It looks like it should be using an LDUR for those instructions. When printing assembly it will produce an ldr b0, [x2, -3] instruction, but emitting obj files gives the large positive offset. I will put a fix in for that issue now.

I've been unable to produce the same output from https://gist.github.com/bjacob/2ed1bce14ae4d67b4261adee70089e29 though - I probably don't know the right set of commands and was just using mlir-translate to convert the file to llvm-ir. Do you know what commands are needed to compile it to assembly?

There is hopefully a fix in 1c6ea961938488997712763762079e535b8b704. Please let me know if that does or doesn't fix your issue, and if you have details on getting assembly from mlir. Thanks

Benoit added a comment.EditedMar 8 2023, 7:47 AM

Thank you very much for the quick fix. I confirm that https://reviews.llvm.org/rG1c6ea961938488997712763762079e535b8b704e fixes the regression.

You probably won't need this anymore since you were able to fix this without it, but just for completeness, here was how to reproduce:

  1. Build https://github.com/openxla/iree - following normal build instructions - note that IREE uses its own submodule third_party/llvm-project.
  2. Run the IREE compiler from the build directory with these flags:
tools/iree-compile --iree-llvm-target-triple=aarch64-none-linux-android29 --iree-hal-target-backends=llvm-cpu ~/pack_testcase.mlir -o /tmp/a.vmfb --iree-llvm-keep-linker-artifacts

Where the input file pack_testcase.mlir is:

func.func @pack_pad_transpose_1x9xi8_into_2x4x8x4xi8(%arg0 : tensor<1x9xi8>) -> tensor<2x4x8x4xi8> {
  %empty = tensor.empty() : tensor<2x4x8x4xi8>
  %c0_i8 = arith.constant 0 : i8
  %pack = tensor.pack %arg0 padding_value(%c0_i8 : i8) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [8, 4] into %empty : tensor<1x9xi8> -> tensor<2x4x8x4xi8>
  return %pack : tensor<2x4x8x4xi8>
}

Thanks to the --iree-llvm-keep-linker-artifacts flag, it will print the path to the generated .so, like this

/usr/local/google/home/benoitjacob/pack_testcase.mlir:4:11: remark: linker artifacts for embedded_elf_arm_64 preserved:
    /tmp/pack_pad_transpose_1x9xi8_into_2x4x8x4xi8_dispatch_0-9c98ea.so

So you can then objdump that as usual,

$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/bin/llvm-objdump -d /tmp/pack_pad_transpose_1x9xi8_into_2x4x8x4xi8_dispatch_0-9c98ea.so