This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][ISel] Always use pre-inc/post-inc addressing mode for auto-indexed load/store with constant offset.
ClosedPublic

Authored by huihuiz on Feb 10 2023, 4:45 PM.

Details

Summary

Unlike ARM target, current AArch64 target doesn't have facility to encode the
operation bit: whether to add an offset to base pointer for pre-inc/post-inc
addressing mode, or to subtract an offset from base pointer for
pre-dec/post-dec addressing mode.

A mis-compile (https://github.com/llvm/llvm-project/issues/60645) was noticed
due to this limitation.

Therefore, for AArch64 auto-indexed load/store with constant offset, always
use pre-inc/post-inc addressing mode. The constant offset is negated for
pre-dec/post-dec addressing mode.
An auto-indexed address with non-constant offset is currently not split into
base and offset parts. If we are to handle non-constant offset in the future,
offset node will need to take a negate.

Diff Detail

Event Timeline

huihuiz created this revision.Feb 10 2023, 4:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 4:45 PM
huihuiz requested review of this revision.Feb 10 2023, 4:45 PM

Test case was reduced from internal benchmark. Assuming we would run into similar issues for post-dec addressing mode.
If needed I can try to synthesize a test for post-dec.

The way this fix is implemented seems suspicious. As far as I can tell, this would basically imply that PRE_DEC never worked on any target. A quick grep shows three targets using it. If the other targets are working, we should probably be fixing this in AArch64-specific code.

Apparently, the AArch64 backend can also generate PRE_INC operations with a negative offset, which might explain why nobody noticed this issue before on AArch64.

Dumped -debug-only=isel,dagcombine for a few targets (armv7a, aarch64, x86_64, riscv64) to see what's going on. Here is what I found:

1, Not all targets support PRE_DEC addressing mode, we only replace a store with equivalent <pre-dec> store when it's legal to do so.
Take previous AArch64 t.ll for example

t6: i64 = add nuw nsw t4, Constant:i64<8>
t15: i64 = shl t6, Constant:i64<2>
t23: i64 = sub t2, t15
t13: ch = store<(store (s32) into %ir.t4)> t0, Constant:i32<0>, t23, undef:i64

t13 was replaced with a <pre-dec> store t32

t27: i64 = shl t4, Constant:i64<2>
t30: i64 = sub t2, t27
t32: i64,ch = store<(store (s32) into %ir.t4), <pre-dec>> t0, Constant:i32<0>, t30, Constant:i64<32>

In DAGCombiner::CombineToPreIndexedLoadStore(), before DAG.getIndexedStore() is called, we first check TLI.isIndexedStoreLegal() in getCombineLoadStoreParts().
Then we set AddrMode AM properly.

For targets like x86_64, index store is not legal.

2, For armv7a target, I change the attached t.ll to use 32-bit as gep offset. CodeGen is correct, for two reasons:
i) For ARM target lowering, commute-with-shift is turned off after legalization, but for AArch64 it's returning true when checking isDesiableToCommuteWithShift(). So constant 32 is pulled out for AArch64 target, but not for ARM target.
ii) Negating offset register for ARM target is done in SelectAddrMode2OffsetReg() , the ARM_AM::getAM2Opc() part.
We generate:
add r1, r1, #8
str r2, [r3, -r1, lsl #2]! // correct

When Selecting node t26
t6: i32 = add nuw nsw t4, Constant:i32<8>
t14: i32 = shl t6, Constant:i32<2>
t26: i32,ch = store<(store (s32) into %ir.t4), <pre-dec>> t0, Constant:i32<0>, t2, t14

we got
Morphed node: t26: i32,ch = STRr_preidx<Mem:(store (s32) into %ir.t4)> Constant:i32<0>, t2, t6, TargetConstant:i32<20482>, TargetConstant:i32<14>, Register:i32 $noreg, t0

TargetConstant:i32<20482> , bit 13 is set to 1 to indicate doing subtraction.

3, I don't seem to find similar handling for AArch64 target. If I am looking at the wrong direction, please point me out?

I think the targets that use PRE_DEC are ARM and AVR. If existing testcases for PRE_DEC work on those targets, maybe we don't need to be too concerned about other targets? I guess maybe they don't hit this code, somehow?

This still doesn't seem conceptually correct, though. A PRE_DEC addressing mode doesn't have to be an immediate; it can be an arbitrary runtime value. And there isn't any way to negate an arbitrary value without creating extra nodes. So I'd expect any necessary negation to happen in target-specific code.

Probably the simplest thing is just to fix the AArch64 code so it never generates PRE_DEC/POST_DEC. They really just make things more complicated, and we don't need them on AArch64 because the all PRE_INC/PRE_DEC ops have an immediate offset.

The pre_store pattern in TableGen seems like sort of a trap: it encompasses both PRE_INC and PRE_DEC, but doesn't provide any way to actually tell whether you matched a PRE_INC or a PRE_DEC.

huihuiz planned changes to this revision.Feb 15 2023, 11:29 AM
huihuiz updated this revision to Diff 497828.Feb 15 2023, 3:50 PM
huihuiz retitled this revision from [SelectionDAG] Negate constant offset before morphing load/store node with pre-dec/post-dec addressing mode. to [AArch64][ISel] Always use pre-inc/post-inc addressing mode for auto-indexed load/store with constant offset..
huihuiz edited the summary of this revision. (Show Details)

Thanks Eli for the feedbacks!

Current AArch64TargetLowering::getIndexedAddressParts() is not splitting out a non-constant offset. If we are to split out a non-constant offset, we need to make sure a negate is applied to that offset.

take another t.ll below , using non-constant offset

define i8* @test(i8* %ptr, i64 %t0, i64 %off) {
  %t1 = add nuw nsw i64 %t0, %off
  %t2 = mul i64 %t1, -4
  %t3 = getelementptr i8, i8* %ptr, i64 %t2
  %t4 = bitcast i8* %t3 to i32*
  store i32 0, i32* %t4, align 4
  %t5 = shl i64 %t1, 2
  %t6 = sub nuw nsw i64 -8, %t5
  %t7 = getelementptr i8, i8* %ptr, i64 %t6
  %t8 = bitcast i8* %t7 to i32*
  store i32 0, i32* %t8, align 4
  ret i8* %ptr
}

We are generating

// %bb.0:
        add     x8, x1, x2
        sub     x8, x0, x8, lsl #2
        str     wzr, [x8]  // We are not splitting out a non-constant offset
        stur    wzr, [x8, #-8]
        ret

The logic here seems fine, but please clean up the dead code (the IsInc argument of getIndexedAddressParts is always set to true, so you can just remove it and clean up the callers).

huihuiz updated this revision to Diff 498089.Feb 16 2023, 11:13 AM

Addressed review comments, also clean out "ISD::MemIndexedMode &AM" from getIndexedAddressParts().

This revision is now accepted and ready to land.Feb 16 2023, 1:43 PM