The previous change to fix crash in the vectorizer introduced
performance regressions. The condition to preserve pointer
address space during the search is too tight, we only need to
match the size.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll | ||
---|---|---|
29–30 ↗ | (On Diff #212880) | It would probably be a more useful test to use an address space that will actually be vectorized (i.e. use global instead of flat) |
test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll | ||
---|---|---|
29–30 ↗ | (On Diff #212880) | It is neither flat or global, I have removed all references to AMD here, there is no triple and no calling convention, only data layout string. |
It looks to me that the root cause is in
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) | If we were to use addrspacecast(1) instead of 5 in the example below, we'd proceed with the checks. I guess one way to handle mismatched address spaces would be to normalize the pointer to the common address space (generic?) and then run the checks for the consecutiveness. |
test/Transforms/LoadStoreVectorizer/AMDGPU/vect-ptr-ptr-size-mismatch.ll | ||
3 ↗ | (On Diff #212880) | It would be good to add a comment that p5:32:32 is the critical part here as that's what causes the original problem. |
Test update.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) | In the meanwhile I am working on the followup patch to handle pointer size differences in graceful way. That requires more code and time and time though. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) | The reason SCEV is unhappy is that we're asking it to do something with different types. Bailing out early will avoid the problem, but it's still too conservative, IMO. We may still be able to do useful optimizations with such pointers. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) | Look, as i said I am working on the more relaxed patch to handle it. I am just trying to think about the corner cases, for example like this: target datalayout = "e-p:64:64-p1:64:64-p5:32:32" define amdgpu_kernel void @ext_ptr_overflow_unsigned(i32 addrspace(5)* %p) { entry: %gep1 = getelementptr inbounds i32, i32 addrspace(5)* %p, i64 4294967297 %gep2 = getelementptr inbounds i32, i32 addrspace(5)* %p, i64 4294967298 %a.ascast = addrspacecast i32 addrspace(5)* %gep1 to i32* %b.ascast = addrspacecast i32 addrspace(5)* %gep2 to i32* %tmp1 = load i32, i32* %a.ascast, align 8 %tmp2 = load i32, i32* %b.ascast, align 8 unreachable } |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) | Fair enough. Let's land your patch as is to unblock things for now. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) | Looks like stripAndAccumulateInBoundsConstantOffsets() takes care of it, so offset always fits a smallest data type in chain. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347–348 ↗ | (On Diff #212880) |
llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347 | Why the store size? |
llvm/trunk/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
347 | This is memory access, so should be in memory size. |
Why the store size?