This is an archive of the discontinued LLVM Phabricator instance.

Differential D159354

[SROA] Limit the number of allowed slices when trying to split allocas
ClosedPublic

Authored by 0xdc03 on Sep 1 2023, 5:55 AM.

Download Raw Diff

Details

Reviewers

nikic

Commits

rGe13e808283f7: [SROA] Limit the number of allowed slices when trying to split allocas

Summary

This patch adds a hidden CLI option "--sroa-max-alloca-slices", which is
an integer that controls the maximum number of alloca slices SROA can
consider before bailing out. This is useful because it may not be
profitable to split memcpys into (possibly tens of) thousands of loads/stores.
This also prevents an issue with exponential compile time explosion in passes
like DSE and MemCpyOpt caused by excessive alloca splitting.

Fixes https://github.com/rust-lang/rust/issues/88580.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

0xdc03 created this revision.Sep 1 2023, 5:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2023, 5:55 AM

Herald added subscribers: StephenFan, JDevlieghere, hiraditya. · View Herald Transcript

0xdc03 requested review of this revision.Sep 1 2023, 5:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2023, 5:55 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nikic added inline comments.Sep 1 2023, 6:15 AM

llvm/lib/Transforms/Scalar/SROA.cpp
127	This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions.

Harbormaster completed remote builds in B256243: Diff 555346.Sep 1 2023, 6:29 AM

0xdc03 added inline comments.Sep 1 2023, 8:17 AM

llvm/lib/Transforms/Scalar/SROA.cpp
127	This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions. The reason I put 32 was because I thought 1024 was quite slow... here's some statistics for different values (running opt -O3 on unoptimized IR produced by rustc): 2 : 0.05s user 4 : 0.04s user 8 : 0.06s user 16 : 0.06s user 32 : 0.08s user 64 : 0.05s user 128 : 0.12s user 256 : 0.56s user 512 : 0.59s user 1024 : 6.62s user 2048 : 53.27s user I tried to run opt without the patch, but I had to give up after 20 minutes. The exponential growth here is quite crazy, but it seems manageable at 512.

As you are adding the limit as an option, you could also include the motivating test case (but specify a lower limit to not blow up the test output).

llvm/lib/Transforms/Scalar/SROA.cpp
127	I think it's okay if it doesn't fully address the compile-time issue, just get it down to a reasonable range. We can't make this limit too low (at least without a more sophisticated heuristic), because things like unrolled loops might legitimately want to SROA into a fairly large number of values. The primary justification for this change should IMHO be that it is not really profitable to split up a single memcpy into tens or hundreds of thousands of load+store operations. Avoiding compile-time issues is a nice extra benefit -- but probably also something we can mitigate by additional changes to the passes that actually end up being slow.

Address reviewer comments
- Add a test case
- Increase default value from 32 to 1024
- Change patch motivation

0xdc03 edited the summary of this revision. (Show Details)Sep 8 2023, 10:16 AM

Harbormaster completed remote builds in B256877: Diff 556282.Sep 8 2023, 1:42 PM

LGTM

This revision is now accepted and ready to land.Sep 8 2023, 7:39 PM

Closed by commit rGe13e808283f7: [SROA] Limit the number of allowed slices when trying to split allocas (authored by 0xdc03). · Explain WhySep 8 2023, 10:32 PM

This revision was automatically updated to reflect the committed changes.

0xdc03 added a commit: rGe13e808283f7: [SROA] Limit the number of allowed slices when trying to split allocas.

This change breaks the AMDGPU backend performance. It causes the 85% performance drop in certain rocBLAS benchmarks.

In D159354#4656343, @alex-t wrote:

This change breaks the AMDGPU backend performance. It causes the 85% performance drop in certain rocBLAS benchmarks.
The approach itself seems weird to me. You'd better add a target hook to let the target decide about the thresholds instead of deciding for others.

This is also being discussed over at https://github.com/llvm/llvm-project/issues/69785, see specifically @nikic's comment. If its causing issues for too many users, then it may be a good idea to revert for now.

nikic added a reverting change: rGed86e740effa: Revert "[SROA] Limit the number of allowed slices when trying to split allocas".Nov 9 2023, 7:39 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

SROA.cpp

11 lines

test/

Transforms/

SROA/

huge-size.ll

174 lines

Diff 556337

llvm/lib/Transforms/Scalar/SROA.cpp

Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines

/// Hidden option to experiment with completely strict handling of inbounds		/// Hidden option to experiment with completely strict handling of inbounds
/// GEPs.		/// GEPs.
static cl::opt<bool> SROAStrictInbounds("sroa-strict-inbounds", cl::init(false),		static cl::opt<bool> SROAStrictInbounds("sroa-strict-inbounds", cl::init(false),
cl::Hidden);		cl::Hidden);
/// Disable running mem2reg during SROA in order to test or debug SROA.		/// Disable running mem2reg during SROA in order to test or debug SROA.
static cl::opt<bool> SROASkipMem2Reg("sroa-skip-mem2reg", cl::init(false),		static cl::opt<bool> SROASkipMem2Reg("sroa-skip-mem2reg", cl::init(false),
cl::Hidden);		cl::Hidden);

		/// The maximum number of alloca slices allowed when splitting.
		static cl::opt<int>
		SROAMaxAllocaSlices("sroa-max-alloca-slices", cl::init(1024),
		nikicUnsubmitted Not Done Reply Inline Actions This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions. nikic: This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise…
		0xdc03AuthorUnsubmitted Not Done Reply Inline Actions This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise there will be regressions. The reason I put 32 was because I thought 1024 was quite slow... here's some statistics for different values (running opt -O3 on unoptimized IR produced by rustc): 2 : 0.05s user 4 : 0.04s user 8 : 0.06s user 16 : 0.06s user 32 : 0.08s user 64 : 0.05s user 128 : 0.12s user 256 : 0.56s user 512 : 0.59s user 1024 : 6.62s user 2048 : 53.27s user I tried to run opt without the patch, but I had to give up after 20 minutes. The exponential growth here is quite crazy, but it seems manageable at 512. 0xdc03: > This limit is probably way too low. I'd expect something on the order of 1024 here, otherwise…
		nikicUnsubmitted Not Done Reply Inline Actions I think it's okay if it doesn't fully address the compile-time issue, just get it down to a reasonable range. We can't make this limit too low (at least without a more sophisticated heuristic), because things like unrolled loops might legitimately want to SROA into a fairly large number of values. The primary justification for this change should IMHO be that it is not really profitable to split up a single memcpy into tens or hundreds of thousands of load+store operations. Avoiding compile-time issues is a nice extra benefit -- but probably also something we can mitigate by additional changes to the passes that actually end up being slow. nikic: I think it's okay if it doesn't fully address the compile-time issue, just get it down to a…
		cl::desc("Maximum number of alloca slices allowed "
		"after which splitting is not attempted"),
		cl::Hidden);

namespace {		namespace {

/// Calculate the fragment of a variable to use when slicing a store		/// Calculate the fragment of a variable to use when slicing a store
/// based on the slice dimensions, existing fragment, and base storage		/// based on the slice dimensions, existing fragment, and base storage
/// fragment.		/// fragment.
/// Results:		/// Results:
/// UseFrag - Use Target as the new fragment.		/// UseFrag - Use Target as the new fragment.
/// UseNoFrag - The new slice already covers the whole variable.		/// UseNoFrag - The new slice already covers the whole variable.
▲ Show 20 Lines • Show All 4,824 Lines • ▼ Show 20 Lines	SROAPass::runOnAlloca(AllocaInst &AI) {
Changed \|= AggRewriter.rewrite(AI);		Changed \|= AggRewriter.rewrite(AI);

// Build the slices using a recursive instruction-visiting builder.		// Build the slices using a recursive instruction-visiting builder.
AllocaSlices AS(DL, AI);		AllocaSlices AS(DL, AI);
LLVM_DEBUG(AS.print(dbgs()));		LLVM_DEBUG(AS.print(dbgs()));
if (AS.isEscaped())		if (AS.isEscaped())
return {Changed, CFGChanged};		return {Changed, CFGChanged};

		if (std::distance(AS.begin(), AS.end()) > SROAMaxAllocaSlices)
		return {Changed, CFGChanged};

// Delete all the dead users of this alloca before splitting and rewriting it.		// Delete all the dead users of this alloca before splitting and rewriting it.
for (Instruction *DeadUser : AS.getDeadUsers()) {		for (Instruction *DeadUser : AS.getDeadUsers()) {
// Free up everything used by this instruction.		// Free up everything used by this instruction.
for (Use &DeadOp : DeadUser->operands())		for (Use &DeadOp : DeadUser->operands())
clobberUse(DeadOp);		clobberUse(DeadOp);

// Now replace the uses of this instruction.		// Now replace the uses of this instruction.
DeadUser->replaceAllUsesWith(PoisonValue::get(DeadUser->getType()));		DeadUser->replaceAllUsesWith(PoisonValue::get(DeadUser->getType()));
▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines

llvm/test/Transforms/SROA/huge-size.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
				; RUN: opt -passes='loop-unroll,sroa' --sroa-max-alloca-slices=16 -S -o - < %s \| FileCheck %s

				; (Very) Reduced from this Rust code:
				;
				; extern "C" {
				; fn _use(x: [[[Option::<usize>; 5]; 5]; 5]) -> bool;
				; }
				; fn main() {
				; let s = [[[Option::<usize>::None; 5]; 5]; 5];
				; unsafe {
				; _use(s);
				; }
				; }

				define void @huge_size() {
				; CHECK-LABEL: define void @huge_size() {
				; CHECK-NEXT: start:
				; CHECK-NEXT: [[ARRAY:%.*]] = alloca [5 x [5 x [5 x { i64, i64 }]]], align 8
				; CHECK-NEXT: [[ARRAY_SUB_1:%.*]] = alloca [5 x [5 x { i64, i64 }]], align 8
				; CHECK-NEXT: br label [[LOOP_1:%.*]]
				; CHECK: loop.1:
				; CHECK-NEXT: br label [[LOOP_2:%.*]]
				; CHECK: loop.2:
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_1]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_6_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 8
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_6_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_69_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 16
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_69_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_7_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 24
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_7_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_718_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 32
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_718_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_8_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 40
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_8_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_827_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 48
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_827_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_9_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 56
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_9_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_936_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 64
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_936_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_10_0_ARRAY_SUB_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[ARRAY_SUB_1]], i64 72
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_10_0_ARRAY_SUB_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[GEP_2_1:%.*]] = getelementptr [5 x [5 x { i64, i64 }]], ptr [[ARRAY_SUB_1]], i64 0, i64 1
				; CHECK-NEXT: store i64 0, ptr [[GEP_2_1]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_6_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 8
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_6_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_69_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 16
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_69_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_7_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 24
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_7_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_718_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 32
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_718_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_8_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 40
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_8_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_827_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 48
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_827_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_9_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 56
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_9_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_936_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 64
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_936_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_10_0_GEP_2_1_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_1]], i64 72
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_10_0_GEP_2_1_SROA_IDX]], align 1
				; CHECK-NEXT: [[GEP_2_2:%.*]] = getelementptr [5 x [5 x { i64, i64 }]], ptr [[ARRAY_SUB_1]], i64 0, i64 2
				; CHECK-NEXT: store i64 0, ptr [[GEP_2_2]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_6_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 8
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_6_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_69_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 16
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_69_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_7_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 24
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_7_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_718_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 32
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_718_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_8_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 40
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_8_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_827_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 48
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_827_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_9_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 56
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_9_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_936_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 64
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_936_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_10_0_GEP_2_2_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_2]], i64 72
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_10_0_GEP_2_2_SROA_IDX]], align 1
				; CHECK-NEXT: [[GEP_2_3:%.*]] = getelementptr [5 x [5 x { i64, i64 }]], ptr [[ARRAY_SUB_1]], i64 0, i64 3
				; CHECK-NEXT: store i64 0, ptr [[GEP_2_3]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_6_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 8
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_6_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_69_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 16
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_69_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_7_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 24
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_7_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_718_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 32
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_718_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_8_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 40
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_8_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_827_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 48
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_827_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_9_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 56
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_9_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_936_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 64
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_936_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_10_0_GEP_2_3_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_3]], i64 72
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_10_0_GEP_2_3_SROA_IDX]], align 1
				; CHECK-NEXT: [[GEP_2_4:%.*]] = getelementptr [5 x [5 x { i64, i64 }]], ptr [[ARRAY_SUB_1]], i64 0, i64 4
				; CHECK-NEXT: store i64 0, ptr [[GEP_2_4]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_6_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 8
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_6_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_69_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 16
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_69_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_7_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 24
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_7_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_718_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 32
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_718_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_8_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 40
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_8_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_827_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 48
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_827_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_9_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 56
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_9_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_936_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 64
				; CHECK-NEXT: store i64 0, ptr [[ARRAY_SUB_2_SROA_936_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: [[ARRAY_SUB_2_SROA_10_0_GEP_2_4_SROA_IDX:%.*]] = getelementptr inbounds i8, ptr [[GEP_2_4]], i64 72
				; CHECK-NEXT: store i64 undef, ptr [[ARRAY_SUB_2_SROA_10_0_GEP_2_4_SROA_IDX]], align 1
				; CHECK-NEXT: br label [[LOOP_3:%.*]]
				; CHECK: loop.3:
				; CHECK-NEXT: call void @llvm.memcpy.p0.p0.i64(ptr [[ARRAY]], ptr [[ARRAY_SUB_1]], i64 1600, i1 false)
				; CHECK-NEXT: [[GEP_3_1:%.*]] = getelementptr [5 x [5 x [5 x { i64, i64 }]]], ptr [[ARRAY]], i64 0, i64 1
				; CHECK-NEXT: call void @llvm.memcpy.p0.p0.i64(ptr [[GEP_3_1]], ptr [[ARRAY_SUB_1]], i64 1600, i1 false)
				; CHECK-NEXT: [[GEP_3_2:%.*]] = getelementptr [5 x [5 x [5 x { i64, i64 }]]], ptr [[ARRAY]], i64 0, i64 2
				; CHECK-NEXT: call void @llvm.memcpy.p0.p0.i64(ptr [[GEP_3_2]], ptr [[ARRAY_SUB_1]], i64 1600, i1 false)
				; CHECK-NEXT: [[GEP_3_3:%.*]] = getelementptr [5 x [5 x [5 x { i64, i64 }]]], ptr [[ARRAY]], i64 0, i64 3
				; CHECK-NEXT: call void @llvm.memcpy.p0.p0.i64(ptr [[GEP_3_3]], ptr [[ARRAY_SUB_1]], i64 1600, i1 false)
				; CHECK-NEXT: [[GEP_3_4:%.*]] = getelementptr [5 x [5 x [5 x { i64, i64 }]]], ptr [[ARRAY]], i64 0, i64 4
				; CHECK-NEXT: call void @llvm.memcpy.p0.p0.i64(ptr [[GEP_3_4]], ptr [[ARRAY_SUB_1]], i64 1600, i1 false)
				; CHECK-NEXT: ret void
				;
				start:
				%array = alloca [5 x [5 x [5 x { i64, i64 }]]], align 8
				%array.sub.1 = alloca [5 x [5 x { i64, i64 }]], align 8
				%array.sub.2 = alloca [5 x { i64, i64 }], align 8
				br label %loop.1

				; Set up %array.sub.2
				loop.1:
				%ind.1 = phi i64 [ 0, %start ], [ %ind.1.next, %loop.1 ]
				%gep.1 = getelementptr [5 x { i64, i64 }], ptr %array.sub.2, i64 0, i64 %ind.1
				store i64 0, ptr %gep.1, align 8
				%ind.1.next = add i64 %ind.1, 1
				%loop1.cond = icmp ult i64 %ind.1.next, 5
				br i1 %loop1.cond, label %loop.1, label %loop.2

				; Set up %array.sub.1
				loop.2:
				%ind.2 = phi i64 [ 0, %loop.1 ], [ %ind.2.next, %loop.2 ]
				%gep.2 = getelementptr [5 x [5 x { i64, i64 }]], ptr %array.sub.1, i64 0, i64 %ind.2
				call void @llvm.memcpy.p0.p0.i64(ptr %gep.2, ptr %array.sub.2, i64 160, i1 false)
				%ind.2.next = add i64 %ind.2, 1
				%loop.2.cond = icmp ult i64 %ind.2.next, 5
				br i1 %loop.2.cond, label %loop.2, label %loop.3

				; Set up %array
				loop.3:
				%ind.3 = phi i64 [ 0, %loop.2 ], [ %ind.3.next, %loop.3 ]
				%gep.3 = getelementptr [5 x [5 x [5 x { i64, i64 }]]], ptr %array, i64 0, i64 %ind.3
				call void @llvm.memcpy.p0.p0.i64(ptr %gep.3, ptr %array.sub.1, i64 1600, i1 false)
				%ind.3.next = add i64 %ind.3, 1
				%loop.3.cond = icmp ult i64 %ind.3.next, 5
				br i1 %loop.3.cond, label %loop.3, label %exit

				exit:
				ret void
				}

				declare void @llvm.memcpy.p0.p0.i64(ptr, ptr, i64, i1)