Page MenuHomePhabricator

[AAch64] Optimize muls with operands having enough zero bits.
ClosedPublic

Authored by bipmis on Dec 6 2022, 3:46 AM.

Details

Summary

Muls with 64bit operands where each of the operand is having top 32 bits as zero, we can generate a single umull instruction on a 32bit operand.

Diff Detail

Event Timeline

bipmis requested review of this revision.Dec 6 2022, 3:46 AM
bipmis created this revision.
david-arm added inline comments.
llvm/test/CodeGen/AArch64/aarch64-mull-masks.ll
908

Perhaps it's worth pre-committing these tests so that we can see what's changed? It's not immediately obvious from the patch what effect the changes have that's all.

Can you add a couple of testcases for the edgecases too. Something like %and = and i64 %ext64, 8589934591 and %and = and i64 %ext64, 2147483647

bipmis updated this revision to Diff 480436.Dec 6 2022, 5:08 AM

Corner case tests added. Rebase with tests committed.

dmgreen accepted this revision.Dec 8 2022, 3:19 AM

LGTM Thanks.

This revision is now accepted and ready to land.Dec 8 2022, 3:19 AM
This revision was landed with ongoing or failed builds.Dec 20 2022, 6:35 AM
This revision was automatically updated to reflect the committed changes.

It looks like this might be running into bootstrap issues, according to the bisect I just did.

I also bisected a regression to this commit.

The misoptimization occurs in this preprocessed file: https://martin.st/temp/lagarith-preproc.c

When built with clang -target aarch64-linux-gnu -S lagarith-preproc.c -o out.s -O3, the difference before/after this change looks like this:

--- good.s      2022-12-21 11:42:08.816749585 +0200
+++ bad.s       2022-12-21 11:43:20.672946992 +0200
@@ -1647,7 +1647,7 @@
        lsr     w14, w20, #1
        mov     x12, xzr
        mov     w9, wzr
-       msub    x11, x13, x20, x11
+       umsubl  x11, w13, w20, x11
        lsl     x11, x11, x10
        lsl     x10, x13, x10
        add     x11, x11, x14
@@ -1661,9 +1661,9 @@
 .LBB2_125:                              // %for.body66.i
                                         // =>This Inner Loop Header: Depth=1
        ldr     w15, [x13, x12]
-       mul     x16, x10, x15
-       mul     x15, x11, x15
-       add     x15, x15, x16, lsr #32
+       umull   x16, w10, w15
+       lsr     x17, x16, #32
+       umaddl  x15, w11, w15, x17
        lsr     x17, x15, #21
        orr     w17, w17, #0x1
        clz     w17, w17
@@ -1687,9 +1687,9 @@
 .LBB2_128:                              // %for.body89.i
                                         // =>This Inner Loop Header: Depth=1
        ldr     w15, [x13, x12]
-       mul     x16, x10, x15
-       mul     x15, x11, x15
-       add     x15, x15, x16, lsr #32
+       umull   x16, w10, w15
+       lsr     x17, x16, #32
+       umaddl  x15, w11, w15, x17
        lsr     x17, x15, #21
        orr     w17, w17, #0x1
        clz     w17, w17

The full runtime case can be reproduced on Linux on aarch64 like this:

$ git clone git://source.ffmpeg.org/ffmpeg
$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../ffmpeg/configure --samples=$(pwd)/../samples --cc=clang
$ make fate-rsync # download test inputs
$ make -j$(nproc) fate-lagarith

The object file with the misoptimization is libavcodec/lagarith.o here.

The misoptimization occurs in this preprocessed file: https://martin.st/temp/lagarith-preproc.c

Thanks for reporting. Should be fixed with the latest commit. The bootstrap passed fine as well.

The misoptimization occurs in this preprocessed file: https://martin.st/temp/lagarith-preproc.c

Thanks for reporting. Should be fixed with the latest commit. The bootstrap passed fine as well.

Thanks! Yes, the issue seems to be fixed now.