In @dot3_float3, while we know that the pointer has sufficient dereferenceable bytes,
the load itself is not aligned sufficiently, and that is the only number
that is currently used to determine legality of performing the wide load.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
5286 | Is this a trick that turning unaligned memory to aligned if the widen bits are dereferenceable? |
Thanks for taking a look!
Make it more obvious that it's not really an alignment, but a count of known-dereferenceable bytes,
and fix another pessimization in process.
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
5286 | Note that LdAlign is *not* used as the alignment for the new load. Hopefully this is better. |
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll | ||
---|---|---|
165 ↗ | (On Diff #360472) | This looks alarming. It's loading 32 bytes instead of 8. |
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
5346 | do we have to reduce NumDereferenecableBytes in the loop? |
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll | ||
---|---|---|
165 ↗ | (On Diff #360472) | Could you please be more specific, do you believe this is a correctness concern, or a performance one? Legalizing node: t67: v10i32,ch = load<(dereferenceable invariant load (s320) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64 Analyzing result type: v10i32 Widen node result 0: t67: v10i32,ch = load<(dereferenceable invariant load (s320) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64 NumDereferenceableBytes 40 Creating new node: t86: v8i32,ch = load<(dereferenceable invariant load (s256) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64 Creating constant: t87: i64 = Constant<32> Creating new node: t88: i64 = add nuw t28, Constant:i64<32> Creating new node: t89: v8i32,ch = load<(dereferenceable invariant load (s256) from %ir.1 + 32, align 4, addrspace 4)> t0, t88, undef:i64 Creating new node: t90: v16i32 = concat_vectors t86, t89 Creating new node: t91: ch = TokenFactor t86:1, t89:1 *** IR Dump Before Module Verifier (verify) *** (function: test_copy_v4i8_x4) ; ModuleID = '/tmp/test.ll' source_filename = "/tmp/test.ll" target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7" ; Function Attrs: nounwind define amdgpu_kernel void @test_copy_v4i8_x4(<4 x i8> addrspace(1)* %out0, <4 x i8> addrspace(1)* %out1, <4 x i8> addrspace(1)* %out2, <4 x i8> addrspace(1)* %out3, <4 x i8> addrspace(1)* %in) #0 { %test_copy_v4i8_x4.kernarg.segment = call nonnull align 16 dereferenceable(76) i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() %out0.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 36 %out0.kernarg.offset.cast = bitcast i8 addrspace(4)* %out0.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)* %1 = bitcast <4 x i8> addrspace(1)* addrspace(4)* %out0.kernarg.offset.cast to <5 x i64> addrspace(4)*, !amdgpu.uniform !0 %2 = load <5 x i64>, <5 x i64> addrspace(4)* %1, align 4, !invariant.load !0 %out0.load1 = extractelement <5 x i64> %2, i32 0 %3 = inttoptr i64 %out0.load1 to <4 x i8> addrspace(1)* %out1.load2 = extractelement <5 x i64> %2, i32 1 %4 = inttoptr i64 %out1.load2 to <4 x i8> addrspace(1)* %out2.load3 = extractelement <5 x i64> %2, i32 2 %5 = inttoptr i64 %out2.load3 to <4 x i8> addrspace(1)* %out3.load4 = extractelement <5 x i64> %2, i32 3 %6 = inttoptr i64 %out3.load4 to <4 x i8> addrspace(1)* %in.load5 = extractelement <5 x i64> %2, i32 4 %7 = inttoptr i64 %in.load5 to <4 x i8> addrspace(1)* %out1.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 44 %out1.kernarg.offset.cast = bitcast i8 addrspace(4)* %out1.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)* %out2.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 52 %out2.kernarg.offset.cast = bitcast i8 addrspace(4)* %out2.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)* %out3.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 60 %out3.kernarg.offset.cast = bitcast i8 addrspace(4)* %out3.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)* %in.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 68 %in.kernarg.offset.cast = bitcast i8 addrspace(4)* %in.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)* %tid.x = call i32 @llvm.amdgcn.workitem.id.x(), !range !1 %idxprom = sext i32 %tid.x to i64 %gep = getelementptr <4 x i8>, <4 x i8> addrspace(1)* %7, i64 %idxprom %val = load <4 x i8>, <4 x i8> addrspace(1)* %gep, align 4 store <4 x i8> %val, <4 x i8> addrspace(1)* %3, align 4 store <4 x i8> %val, <4 x i8> addrspace(1)* %4, align 4 store <4 x i8> %val, <4 x i8> addrspace(1)* %5, align 4 store <4 x i8> %val, <4 x i8> addrspace(1)* %6, align 4 ret void } ; Function Attrs: nounwind readnone speculatable willreturn declare i32 @llvm.amdgcn.workitem.id.x() #1 ; Function Attrs: nounwind readnone speculatable willreturn declare i32 @llvm.amdgcn.workitem.id.y() #1 ; Function Attrs: nounwind readnone speculatable willreturn declare align 4 i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #2 ; Function Attrs: convergent nounwind willreturn declare { i1, i64 } @llvm.amdgcn.if.i64(i1) #3 ; Function Attrs: convergent nounwind willreturn declare { i1, i64 } @llvm.amdgcn.else.i64.i64(i64) #3 ; Function Attrs: convergent nounwind readnone willreturn declare i64 @llvm.amdgcn.if.break.i64(i1, i64) #4 ; Function Attrs: convergent nounwind willreturn declare i1 @llvm.amdgcn.loop.i64(i64) #3 ; Function Attrs: convergent nounwind willreturn declare void @llvm.amdgcn.end.cf.i64(i64) #3 attributes #0 = { nounwind "amdgpu-memory-bound"="true" "amdgpu-wave-limiter"="true" "target-cpu"="tahiti" } attributes #1 = { nounwind readnone speculatable willreturn "target-cpu"="tahiti" } attributes #2 = { nounwind readnone speculatable willreturn } attributes #3 = { convergent nounwind willreturn } attributes #4 = { convergent nounwind readnone willreturn } !0 = !{} !1 = !{i32 0, i32 1024} |
Don't forget to decrease the dereferenceable size in loop.
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
5346 | Oh, hmm, yes i do believe we do indeed, good spot! |
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll | ||
---|---|---|
165 ↗ | (On Diff #360472) | I meant it looks like an alarming performance regression. I have no reason to believe it's incorrect. |
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll | ||
---|---|---|
165 ↗ | (On Diff #360472) | The problem you're going to hit is FindMemType searches for the biggest possible valid type - and the dereferencable change has relaxed that validity check even more - you're probably going to have to get FindMemType to continue searching the smaller types until it fails a second validity test? |
llvm/test/CodeGen/X86/widen_load-3.ll | ||
---|---|---|
118 ↗ | (On Diff #360497) | How are we proving this transform is legal? |
Attempt to fix issues.
Unless i'm doing it wrong, this seems to have fixed regressions,
and maybe fixed a few existing miscompiles?
Ok, now *this* should be it.
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
5330 | This is rather horrible :( |
llvm/test/CodeGen/ARM/vector-load.ll | ||
---|---|---|
257 ↗ | (On Diff #360594) | We probably shouldn't be widening volatile loads like this. Not that this particular testcase reflects anything useful, but still. |
llvm/test/CodeGen/ARM/vector-load.ll | ||
---|---|---|
257 ↗ | (On Diff #360594) | Indeed. Looks like we discovered this bug at the same time. |
Hi,
we talked a bit about this internally and found some potential problems.
There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.
Even if it can be proven that enough bytes are dereferenceable, we do not want to widen loads for amdgpu (at least not in cases where the load gets a lot larger, i.e. loading 64 Bytes instead of 32). Widening a load poses more restrictions on the register allocator, as a larger consecutive set of registers needs to be allocated. And, in the case that a widened load hits more cache lines than before, it also consumes more memory bandwidth.
Thanks for taking a look!
This is FUD.
/// isDereferenceable - Return true if V is always dereferenceable for /// Offset + Size byte. bool MachinePointerInfo::isDereferenceable(<...>
/// Returns true if V is always dereferenceable for Size byte with alignment /// greater or equal than requested. If the context instruction is specified /// performs context-sensitive analysis and returns true if the pointer is /// dereferenceable at the specified instruction. bool isDereferenceableAndAlignedPointer(<...>
/// Check if executing a load of this pointer value cannot trap. /// /// If DT and ScanFrom are specified this method performs context-sensitive /// analysis and returns true if it is safe to load immediately before ScanFrom. /// /// If it is not obviously safe to load from the specified pointer, we do /// a quick local scan of the basic block containing \c ScanFrom, to determine /// if the address is already accessed. /// /// This uses the pointee type to determine how many bytes need to be safe to /// load from the pointer. bool llvm::isSafeToLoadUnconditionally(Value *V, Align Alignment, APInt &Size, const DataLayout &DL, Instruction *ScanFrom, const DominatorTree *DT, const TargetLibraryInfo *TLI) { // If DT is not specified we can't make context-sensitive query const Instruction* CtxI = DT ? ScanFrom : nullptr; if (isDereferenceableAndAlignedPointer(V, Alignment, Size, DL, CtxI, DT, TLI)) return true;
We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.
we do not want to widen loads for amdgpu (at least not in cases where the load gets a lot larger, i.e. loading 64 Bytes instead of 32). Widening a load poses more restrictions on the register allocator, as a larger consecutive set of registers needs to be allocated. And, in the case that a widened load hits more cache lines than before, it also consumes more memory bandwidth.
Ok, so we need a target hook, that's doable.
I agree with that, I misunderstand the code then.
if (LD->isSimple()) { NumDereferenceableBytes = LD->getAlignment(); if (!LdWidth.isScalable()) NumDereferenceableBytes = std::max<unsigned>(NumDereferenceableBytes, LdWidth / 8); if (!WidenWidth.isScalable() && NumDereferenceableBytes < WidenWidth / 8 && LD->getPointerInfo().isDereferenceable( WidenWidth / 8, *DAG.getContext(), DAG.getDataLayout())) NumDereferenceableBytes = std::max<unsigned>(NumDereferenceableBytes, WidenWidth / 8); }
As far as I understand this part, NumDereferenceableBytes is set to the (guaranteed minimum) alignment of the load.
In some cases, where it’s ok, e.g. because the dereferenceable size is set, NumDereferenceableBytes is increased.
So, NumDereferenceableBytes is always at least the alignment?
isDereferenceable(LD->getAlignment()) is never checked as far as I see.
CC @nlopes @aqjune - is alive2 correct for LLVM IR semantics?
That is true, and please note that it is already what is happening regardless of this patch, as you can see from the LHS of the diff.
Looks like that was most originally added by rL94338.
Now that i look, indeed, while deref is the source of truth (https://alive2.llvm.org/ce/z/9TAAwj vs https://alive2.llvm.org/ce/z/bmKMm4),
as per the LLVM IR semantics as modelled by alive2, alignment does not guarantee dereferenceability: https://alive2.llvm.org/ce/z/-HxoLA
(CC @nlopes @aqjune)
In some cases, where it’s ok, e.g. because the dereferenceable size is set, NumDereferenceableBytes is increased.
So, NumDereferenceableBytes is always at least the alignment?
isDereferenceable(LD->getAlignment()) is never checked as far as I see.
Right, but that is so regardless of this patch.
@lebedev.ri Yep, Alive2's encoding assumes that dereferenceable's encoding has no relation with align value.
LangRef does not seem to impose that a pointer must be dereferenceable by at least its alignment.
Hmm, that's, interesting.
So how does the current existing transform motivate the legality of what it does i wonder?
I guess the motivation was that the assumption was valid for certain architectures (I'm not an expert in this area though)
But, I agree that such information is valuable not only in codegen but also for middle-end transformations (e.g. attributor).
Alignment implying lower granularity of "dereferenceable" isn't justified in LangRef anywhere. In practice, it works out on common targets because it's impossible to allocate less than a page of memory. (At least, in userspace...)
It's something that would be nice to preserve in the cases where it's justified, but maybe not worth the complexity of trying to track the relevant page size on every target.
llvm/test/CodeGen/AMDGPU/kernel-args.ll | ||
---|---|---|
2266–2402 | I think this diff is misleading. I've pushed an alternative version here, which is based on first autogenerating the checks in this file before applying D106447: https://github.com/jayfoad/llvm-project/commit/ed4a53da7c265675510caeb2c773f1b461354347 In this particular case (v5i64_arg for SI) the relevant part of the diff is: -; SI-NEXT: s_load_dwordx8 s[4:11], s[0:1], 0x19 -; SI-NEXT: s_load_dwordx2 s[12:13], s[0:1], 0x9 -; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x21 -; SI-NEXT: s_mov_b32 s15, 0xf000 -; SI-NEXT: s_mov_b32 s14, -1 +; SI-NEXT: s_load_dwordx16 s[4:19], s[0:1], 0x19 +; SI-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x9 +; SI-NEXT: s_mov_b32 s3, 0xf000 +; SI-NEXT: s_mov_b32 s2, -1 I.e. the load from scaled offset 0x9 is untouched; the significant difference is that the dwordx8 and dwordx2 load from offsets 0x19 and 0x21 have been replaced by a single dwordx16 load from offset 0x19. |
llvm/test/CodeGen/AMDGPU/kernel-args.ll | ||
---|---|---|
2266–2402 | That's why nobody likes manually-written check lines.. |
llvm/test/CodeGen/AMDGPU/kernel-args.ll | ||
---|---|---|
2266–2402 | Actually, i see that you have not pushed https://github.com/llvm/llvm-project/commit/00b0f1ed7d5833695ba266854ee1d07467ddcf9c, so there isn't much point in rebasing. |
Rebased after D107052 (thanks!)
I think the AMDGPU test change looks somewhat less scary now.
Thoughts?
I think we will still need a target hook or some other way of restricting this. In the v5i64_arg case I quoted earlier we're loading 16 dwords instead of 10. Yes the 16-dword load is legal, but that doesn't mean it's a good idea to use it. It increases register pressure (which is especially bad on GPUs) and may load from more cache lines.
clang-tidy: warning: invalid case style for function 'FindMemType' [readability-identifier-naming]
not useful