- If a load is naturally aligned but not DWORD aligned, load wideningg should handle the case where the address is the sum of a base pointer and a constant offset. That load is able to be widened by aligning that address and extracting the narrow value from the high part.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
390 ms | linux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp |
Event Timeline
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7469 ↗ | (On Diff #265435) | Typo quaranteed |
7472–7475 ↗ | (On Diff #265435) | This will be true for all of the preloaded SGPRs. However I think looking for the argument copies here is the wrong approach. The original intrinsic calls should have been annotated with the align return value attribute. Currently this is a burden on the frontend/library call emitting the intrinsic. We could either annotate the intrinsic calls in one of the later passes (maybe AMDGPUCodeGenPrepare), or add a minimum alignment to the intrinsic definition which will always be applied, similar to how intrinsic declarations already get their other attributes |
7509–7550 ↗ | (On Diff #265435) | We already have essentially the same code in lowerKernargMemParameter (plus an IR version in AMDGPULowerKernelArguments). This just generalizes to any isBaseWithConstantOffset. Can you refactor these to avoid the duplication? |
7535 ↗ | (On Diff #265435) | Can you use one of the simpler getLoad overloads? We don't really ever need to mention unindexed |
7537 ↗ | (On Diff #265435) | I think you could have a better alignment than 4 here |
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll | ||
32 | Should also have some tests with multiple loads. You should make sure that a use of the base address load, and a use of the offset address both CSE into a single load with 1 shift. Also could test with an explicit align attribute on the dispatch ptr call |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | do we have that attribute available? could you elaborate in detail? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | OK, I found that. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | In another review D80422, relevant intrinsics are being annotated with assumed alignment on the return pointer. Unfortunately, in SelectionDAG, we won't have the facility to keep tracking of that hint. I will enhance AMDGPUCodeGenPrepare pass to do that similar thing. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | We don't need to explicitly know the pointer alignment. If the intrinsic was properly annotated from the beginning (as would be the case after D80422), the downstream load users would have the correct alignment assigned. The optimizer propagates alignment information already |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | That load won't be marked as DWORD aligned even after those intrinsics are annotated correctly as they are just not DWORD aligned. For example, the newly-added test has an i16 load from pointer (dippatch.ptr + 6). It's not DOWRD aligned. The transformation performed here is to widen such load if that pointer is calculated in the form a DWORD aligned pointer (base) with a constant offset (off). With that, (i16 load (add base, off)) is translated into (trunc (lshr (i32 load (add base, off - 2))) We need to know that base is DWORD aligned. However, in SDNode, we don't have any facility to retain the original alignment from LLVM IR. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | is computeKnownBitsForTargetNode called for CopyFromReg? If so, we could move the argument check in there |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
7472–7475 ↗ | (On Diff #265435) | unfortunately NO |
this change depends on D80422 to add align attribute properly on relevant intrinsics.
I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?
llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp | ||
---|---|---|
33–37 | If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare | |
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll | ||
33 | Missing align? |
That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.
But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:
after opt -instcombine:
define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 { %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4 %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)* %v1 = load i16, i16 addrspace(4)* %h1, align 4 %e1 = zext i16 %v1 to i32 %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6 %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)* %v2 = load i16, i16 addrspace(4)* %h2, align 2 %e2 = zext i16 %v2 to i32 %o = add nuw nsw i32 %e2, %e1 store i32 %o, i32 addrspace(1)* %out, align 4 ret void }
Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:
opt -instcombine -load-store-vectorizer -instsimplify gives:
define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 { %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4 %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)* %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)* %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4 %v11 = extractelement <2 x i16> %2, i32 0 %v22 = extractelement <2 x i16> %2, i32 1 %e1 = zext i16 %v11 to i32 %e2 = zext i16 %v22 to i32 %o = add nuw nsw i32 %e2, %e1 store i32 %o, i32 addrspace(1)* %out, align 4 ret void }
So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline
but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.
define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 { %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4 %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)* %v1 = load i16, i16 addrspace(4)* %h1, align 4 %e1 = zext i16 %v1 to i32 %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6 %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)* %v2 = load i16, i16 addrspace(4)* %h2, align 2 %e2 = zext i16 %v2 to i32 %o = add nuw nsw i32 %e2, %e1 store i32 %o, i32 addrspace(1)* %out, align 4 ret void }Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:
opt -instcombine -load-store-vectorizer -instsimplify gives:
define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 { %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4 %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)* %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)* %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4 %v11 = extractelement <2 x i16> %2, i32 0 %v22 = extractelement <2 x i16> %2, i32 1 %e1 = zext i16 %v11 to i32 %e2 = zext i16 %v22 to i32 %o = add nuw nsw i32 %e2, %e1 store i32 %o, i32 addrspace(1)* %out, align 4 ret void }So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline
This was more of a concern when letting the DAG handle it. Since you added the implied align attribute, I don't think anything else should be needed
I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?
test2, that's a real example. The widening done in this patch is not supported in any code in LLVM. The load, by itself, is not DWORD aligned but only naturally aligned no matter whether alignment marking is correct or not.
I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)
test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.
Scalar load widening should run after LSV to generate redundant loads. Cases like a sequence of consecutive loads of i16 benefit from such an organization to avoid redundant load generation. Here's the details
for 4 loads of i16
ld.i16 (ptr + 0) ld.i16 (ptr + 2) ld.i16 (ptr + 4) ld.i16 (ptr + 6)
If we run scalar load widening before LSV. After widening, we have
ld.i16 (ptr + 0) ld.i32 (ptr + 0) ld.i16 (ptr + 4) ld.i32 (ptr + 4)
After LSV, we have
ld.i16 (ptr + 0) ld.i32x2 (ptr + 0) ld.i16 (ptr + 4)
That 2 i16 loads are redundant. If we run scalar load widening after LSV, we won't have that result.
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
801 | It can simply to to GCNPassConfig::addPreISel(). |
This patch is still required as MMO's alignment is calculated based on the offset from the base alignment. As the base alignment is the alignment from the pointer in the IR, it cannot be modified. We need extra logic to re-align MMO operand if we widen the original one. For instance of a 16-bit load from ptr has an alignment of 2, if ptr is equivalent to base - 2 and base's alignment is 4, we could widen that 16-bit load to 32-bit load from ptr - 2with an alignment 4. But, as we cannot change IR in MMO, we need extra stuff to in the new MMO could assume that new alignment.
If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare