The motivated example is:
enum ResultType { a, b, c, d, Error, }; struct Result { Result(ResultType type = Error, unsigned hash = 0) : type(type), hash(hash) {} ResultType type; unsigned hash; }; template<typename Function> inline Result foo(Function function) { bool done; Result result; std::tie(done, result) = function(); if (done) return result; return Result(Error); } int main(int argc, char** argv) { auto function = [] { return std::make_tuple(false, Result()); }; Result result = foo(function); return int(result.type); }
When compiled with libc++, llvm generates:
movb $0, -16(%rsp) movq $4, -12(%rsp) movq -16(%rsp), %rcx movq %rcx, -16(%rsp) movl $4, %eax testb %cl, %cl je .LBB0_2
All of the memory accesses are redundant.
The problem is the underlying tuple structure looks like
{i8, {i32, i32}}
Its total size is 96 bit, small enough to be returned through registers, but as function return value its type is changed to
{i64, i32}
So for the temporary alloca object to receive the result of the lambda function, it is written and read as different types. When alloca slices are built from memory accesses, these slices overlapped with each other
Slices of alloca: %6 = alloca %"struct.std::__u::__tuple_impl", align 8 [0,8) slice #0 used by: store i64 %20, i64* %22 [0,1) slice #1 used by: %31 = load i8, i8* %30, align 8 [0,12) slice #2 (splittable) used by: call void @llvm.lifetime.end.p0i8(i64 12, i8* %40) [0,12) slice #3 (splittable) used by: call void @llvm.lifetime.start.p0i8(i64 12, i8* %12) [4,12) slice #4 used by: %37 = load i64, i64* %36, align 4 [8,12) slice #5 used by: store i32 %21, i32* %23, align 8
then all of the slices are grouped together as a single one, so no SROA occurred.
This patch solved the problem by splitting some integer load/store which covers multiple fields of the alloca aggregate, and these fields have different parent structure. In following example
{i32, {i32, i32}} %ptrval = ptrtoint %struct.ptr* %ptr to i64 %ptrval2 = add i64 %ptrval, 4 %ptr1 = inttoptr i64 %ptrval to i64* %ptr2 = inttoptr i64 %ptrval2 to i64* %val1 = load i64, i64* ptr1, align 4 %val2 = load i64, i64* ptr2, align 4
The first 64-bit load will be rewritten to 2 32-bit loads because it actually access 2 fields in the original aggregate, and the two fields don't belong to the same inner structure.
The second load won't be rewritten because all fields accessed by the load belong to the same inner structure, it's a common case in LLVM IR.
Rearranging functions like this makes it harder to read the patch.