Muls with 64bit operands where each of the operand is having top 32 bits as zero, we can generate a single umull instruction on a 32bit operand.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/AArch64/aarch64-mull-masks.ll | ||
---|---|---|
908 | Perhaps it's worth pre-committing these tests so that we can see what's changed? It's not immediately obvious from the patch what effect the changes have that's all. |
Can you add a couple of testcases for the edgecases too. Something like %and = and i64 %ext64, 8589934591 and %and = and i64 %ext64, 2147483647
It looks like this might be running into bootstrap issues, according to the bisect I just did.
The misoptimization occurs in this preprocessed file: https://martin.st/temp/lagarith-preproc.c
When built with clang -target aarch64-linux-gnu -S lagarith-preproc.c -o out.s -O3, the difference before/after this change looks like this:
--- good.s 2022-12-21 11:42:08.816749585 +0200 +++ bad.s 2022-12-21 11:43:20.672946992 +0200 @@ -1647,7 +1647,7 @@ lsr w14, w20, #1 mov x12, xzr mov w9, wzr - msub x11, x13, x20, x11 + umsubl x11, w13, w20, x11 lsl x11, x11, x10 lsl x10, x13, x10 add x11, x11, x14 @@ -1661,9 +1661,9 @@ .LBB2_125: // %for.body66.i // =>This Inner Loop Header: Depth=1 ldr w15, [x13, x12] - mul x16, x10, x15 - mul x15, x11, x15 - add x15, x15, x16, lsr #32 + umull x16, w10, w15 + lsr x17, x16, #32 + umaddl x15, w11, w15, x17 lsr x17, x15, #21 orr w17, w17, #0x1 clz w17, w17 @@ -1687,9 +1687,9 @@ .LBB2_128: // %for.body89.i // =>This Inner Loop Header: Depth=1 ldr w15, [x13, x12] - mul x16, x10, x15 - mul x15, x11, x15 - add x15, x15, x16, lsr #32 + umull x16, w10, w15 + lsr x17, x16, #32 + umaddl x15, w11, w15, x17 lsr x17, x15, #21 orr w17, w17, #0x1 clz w17, w17
The full runtime case can be reproduced on Linux on aarch64 like this:
$ git clone git://source.ffmpeg.org/ffmpeg $ mkdir ffmpeg-build $ cd ffmpeg-build $ ../ffmpeg/configure --samples=$(pwd)/../samples --cc=clang $ make fate-rsync # download test inputs $ make -j$(nproc) fate-lagarith
The object file with the misoptimization is libavcodec/lagarith.o here.
Thanks for reporting. Should be fixed with the latest commit. The bootstrap passed fine as well.
Perhaps it's worth pre-committing these tests so that we can see what's changed? It's not immediately obvious from the patch what effect the changes have that's all.