- X%C to the equivalent of X-X/C*C is not always fastest path if there is no SDIV pair exist. So check target have faster for srem only first.
- Add AArch64 faster path for SREM only pow2 case
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13591 | Probably worth a comment explaining this. Lg2 == 0 folds to a constant. Not obvious why we're special-casing Lg2 == (VT.getScalarSizeInBits() - 1); does this sequence not work somehow? | |
llvm/test/CodeGen/AArch64/srem-lkk.ll | ||
171 ↗ | (On Diff #420710) | Maybe we also want some tests for i16? Please post a separate patch to add the new tests, so it's easier to compare the impact. And actually, while you're at it, maybe stick all the power-of-two tests in a separate file with a better name. |
llvm/test/CodeGen/AArch64/srem-vector-lkk.ll | ||
188 | I think this somehow ended up shorter with D122829? That might be fine, though; it looks like this version avoids the expensive add-with-shift instructions. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4325 | I'm not sure I like the modified API for visitSDIVLike? It's sort of weird that the return value is either the quotient or the remainder, depending on the value of BuildRem. And that BuildRem is both an input and an output. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4325 | Yeah, this part is ugly. I will refactor this part today |
- refactor visitSDIVLike
- remove Lg2 == (VT.getScalarSizeInBits() - 1)
- add srem-pow2.ll, will rebase after D123671 land
Please ignore the check in rGe2d77a160c5b. I made a mistake when check in another baseline.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4582 | Do we need to do something special for something like the following, to match the existing handling for div/rem pairs? I guess it's only a couple instructions different either way, but maybe worth considering. define void @sdivrem(i32 %x, i32* %ap, i32* %bp) { %a = sdiv i32 %x, 4 %b = srem i32 %x, 4 store i32 %a, i32* %ap store i32 %b, i32* %bp ret void } | |
23902 | Fix this comment? Not sure how you compute srem "by right shifting" |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4582 | For now I have no AArch64 device to test, do we have some tools to verify instructions similar to alive2.llvm.org/ce/ on AArch64? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4582 | I use IR to write the logic. Is this the behavior you want? |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4582 | Based on the IR to asm by llc, it looks sdiv/srem pair can't get benifit from this pattern. And it looks we lack dag combine from neg + cmp to negs based on the test. |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4582 | I just meant that I wasn't sure if we would end up with the same code sequence as we would without this patch; I wasn't suggesting any specific transform. (I guess what happens might be sensitive to whether we visit the SDIV or the SREM first...) |
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
4582 | I check the !DAG.doesNodeExist(ISD::SDIV, N->getVTList(), {N0, N1}) at line 4412, SDIV / SREM paire won't go into the srem only path |
FYI, I’ve bisected a misoptimization down to this commit. I’ll follow up later with a proper repro…
This commit causes misoptimizations of the following function:
int func(int val) { val += val % 2; return val; }
Tested with a caller like this:
#include <stdio.h> int func(int val); int main(int argc, char* argv[]) { for (int i = -3; i <= 3; i++) printf("%d -> %d\n", i, (int)func(i)); return 0; }
Originally this function returns the following values:
-3 -> -4 -2 -> -2 -1 -> -2 0 -> 0 1 -> 2 2 -> 2 3 -> 4
After this commit, it prints the following:
-3 -> -2 -2 -> -2 -1 -> 0 0 -> 0 1 -> 0 2 -> 2 3 -> 2
Originally, the function was compiled into the following (with -O2):
func: cmp w0, #0 cinc w8, w0, lt and w8, w8, #0xfffffffe sub w8, w0, w8 add w0, w8, w0 ret
After this change, it gets incorrectly compiled into this:
func: and w8, w0, #0x1 cmp w0, #0 cneg w8, w8, ge add w0, w8, w0 ret
Should be 'csneg w8, w8, w8, ge' here
add w0, w8, w0 ret
Thanks for the finding. I have no AArch64 device for now , so can you help to modify line 13600 to
SDValue Cmp = getAArch64Cmp(N0, Zero, ISD::SETGE, CCVal, DAG, DL);
If you want to try out some stuff yourself, on Linux, it isn't too hard to run stuff using qemu-user. Get a cross-compile toolchain (e.g. https://developer.arm.com/-/media/Files/downloads/gnu/11.2-2022.02/binrel/gcc-arm-11.2-2022.02-x86_64-aarch64-none-linux-gnu.tar.xz ). The do something like:
clang --gcc-toolchain=/path/to/toolchain --target=aarch64-none-linux-gnu test.c -o test -static
qemu-aarch64 test
I'll run this through some tests today.
Thanks for the way running on qemu user mode.
I will build the environment on WSL and test on my local machine.
Please rename this so it's clear it's not the primary visitor function for SREM. Maybe call it "buildOptimizedSREM" or something like that.