Download Raw Diff

Details

Reviewers

nemanjai
echristo
steven.zhang
hfinkel
hiraditya
jsji
efriedma

Commits

rG2d51adcb5714: [PowerPC] Set the innermost hot loop to align 32 bytes
rL363495: [PowerPC] Set the innermost hot loop to align 32 bytes

Summary

If the nested loop is an innermost loop, prefer to a 32-byte alignment, so that we can decrease cache misses and branch-prediction misses. Actual alignment of the loop will depend on the hotness check and other logic in alignBlocks.

The old code will only align hot loop to 32 bytes when the LoopSize larger than 16 bytes and smaller than 32 bytes, this patch will align the innermost hot loop to 32 bytes not only for the hot loop whose size is 16~32 bytes.

For some special cases, the performance can improve more than 30% after adding the patch for ppc.

This patch have a dependency on the patch D61227: [NFC]][PowerPC] Use -check-prefixes to simplify the check in code-align.ll.

Diff Detail

Event Timeline

ZhangKang created this revision.Apr 27 2019, 9:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2019, 9:41 AM

ZhangKang edited the summary of this revision. (Show Details)Apr 27 2019, 9:44 AM

For some special cases, the performance can improve more than 30% after adding the patch for ppc.

Any significant regressions?

jsji added a parent revision: D61227: [NFC]][PowerPC] Use -check-prefixes to simplify the check in code-align.ll.Apr 29 2019, 7:33 AM

In D61228#1482324, @hfinkel wrote:

For some special cases, the performance can improve more than 30% after adding the patch for ppc.

Any significant regressions?

@hfinkel , I have run spec test after adding this patch, the performance is same. The old code will alignment the loop whose size is more than 32 bytes to align 16 bytes, and if we want get a better performance, the outer loop need very large, and the innermost loop need very small. I think there are few cases meet the condition(the outer loop is very large, and the innermost loop is very small.).

I say that "For some special cases, the performance can improve more than 30%" is for the special small case I write. Below is the test case on P9:

cat foo.c

cpp
struct parm {
  int *arr;
  int m;
  int n;
};
void foo(struct parm *arg) {
  struct parm localArg = *arg;
  int m = localArg.m;
  int *s = localArg.arr;
  int n = localArg.n;
  do{
    int k = n;
    do{
      s[++k] = k++;
      s[k++] = k;
      s[k++] = k;
      s[k] = k;
      s[--k] = k--;
      s[k--] = k;
      s[--k] = k;
    }while(k--);
  } while(m--);

  s[n]=0;
}

cat main.c

cpp
struct parm {
  int *arr;
  int m;
  int n;
};
void foo(struct parm*);
int main() {
  int a[5000];
  struct parm arg = {a, 2000000000, 5};
  foo(&arg);
  return 0;
}

cat run.ksh

shell
set -x
# profile-generate
rm t t.* t_* *.o *.s *.profraw *.profdata
clang -c main.c -O -fprofile-generate
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-generate
clang -o t main.o foo.s -fprofile-generate
objdump -dr t > t.dis
time -p ./t

# merge
llvm-profdata merge *.profraw -output=merge.profdata

# profile-use
clang -c main.c -O -fprofile-use=merge.profdata
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-use=merge.profdata
clang -o t_pgo main.o foo.s -fprofile-use=merge.profdata
objdump -dr t_pgo > t_pgo.dis
time -p ./t_pgo

The origin result(not set loop align to 32 bytes) is below:

real 21.74
user 21.74
sys 0.00

After adding the patch, the result is below:

real 14.37
user 14.37
sys 0.00

Note that, the performance speedup rate is different when using different parameters to call the foo function. In general, if the outer loop is larget and the inner loop is smaller, the branch prediction is more likely failed, and the speedup rate is larger. For example, the speedup rate of foo(a, 2000000000, 5) is larger than foo(a, 20000000, 500).

@hfinkel , do you have any other comments?

In D61228#1504303, @ZhangKang wrote:

@hfinkel , do you have any other comments?

Can you explain why we don't always do this? You're checking for profiling data but then not using it?

Can you explain why we don't always do this? You're checking for profiling data but then not using it?

I think the idea is that with PGO data, we are more certain that the hotness information is actually meaningful - since the alignment directive will only be emitted for loops that are "hot". Without PGO data, we will align a majority of loops which may be overkill. But this should be clearly stated in the comment though.

In D61228#1505513, @nemanjai wrote:

Can you explain why we don't always do this? You're checking for profiling data but then not using it?

I think the idea is that with PGO data, we are more certain that the hotness information is actually meaningful - since the alignment directive will only be emitted for loops that are "hot". Without PGO data, we will align a majority of loops which may be overkill. But this should be clearly stated in the comment though.

Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Short story: yes, I agree that we should probably just do this regardless of PGO if we don't see any significant performance regressions on important benchmarks.

TL; DR;
The hotness of the loop is checked in MachineBlockPlacement::alignBlocks(). I think the concern with aligning all loops statically determined to be "hot" to 32-bytes runs the risk of a pathologically bad case such as the following:

for (int i = 0; i < HugeValue; i++) {
  // Enough instructions to make the inner loop fall one instruction past a 32-byte boundary
  for (int j = 0; j < UnpredictableValueHighlyLikelyToBeZero; j++)
    // Do something short
}

Such a case would end up with 7 nops to align the inner loop which would presumably tie up dispatch slots. All that being said, I just ran an experiment with exactly that pathological case and the performance degrades by about 1% (which may even be in the noise).

In D61228#1506487, @nemanjai wrote:
Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Short story: yes, I agree that we should probably just do this regardless of PGO if we don't see any significant performance regressions on important benchmarks.

TL; DR;
The hotness of the loop is checked in MachineBlockPlacement::alignBlocks(). I think the concern with aligning all loops statically determined to be "hot" to 32-bytes runs the risk of a pathologically bad case such as the following:
for (int i = 0; i < HugeValue; i++) {
  // Enough instructions to make the inner loop fall one instruction past a 32-byte boundary
  for (int j = 0; j < UnpredictableValueHighlyLikelyToBeZero; j++)
    // Do something short
}
Such a case would end up with 7 nops to align the inner loop which would presumably tie up dispatch slots. All that being said, I just ran an experiment with exactly that pathological case and the performance degrades by about 1% (which may even be in the noise).

@nemanjai @hfinkel , I have tested the spec performance without PGO for this patch, there is no performance regressions for alignment 32 without PGO.

For the 14 spec tests I tested, the size of case 554.roms_r has increased 23% after align to 32 bytes for the base test. The size of other cases are same.

In D61228#1506487, @nemanjai wrote:
Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Short story: yes, I agree that we should probably just do this regardless of PGO if we don't see any significant performance regressions on important benchmarks.

TL; DR;
The hotness of the loop is checked in MachineBlockPlacement::alignBlocks(). I think the concern with aligning all loops statically determined to be "hot" to 32-bytes runs the risk of a pathologically bad case such as the following:
for (int i = 0; i < HugeValue; i++) {
  // Enough instructions to make the inner loop fall one instruction past a 32-byte boundary
  for (int j = 0; j < UnpredictableValueHighlyLikelyToBeZero; j++)
    // Do something short
}
Such a case would end up with 7 nops to align the inner loop which would presumably tie up dispatch slots. All that being said, I just ran an experiment with exactly that pathological case and the performance degrades by about 1% (which may even be in the noise).

@nemanjai , I have tested the case you give, after doing the alignment to 32 bytes without PGO, the performance doesn't degrade, I think we'd better align the inner loop without PGO data. What do you think?

Updated the patch to loop 32 bytes for innermost hot loop even if there is no PGO data.

@nemanjai @hfinkel I have updated the patch to align to 32 bytes even if wthout PGO data.

In D61228#1541866, @ZhangKang wrote:

@nemanjai @hfinkel I have updated the patch to align to 32 bytes even if wthout PGO data.

Thanks for all of the additional benchmarking. LGTM.

LGTM, some comments /renaming can be done before committing. Thanks.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
116	We only apply to innermost loops, can we use something like `DisableInnerMostLoopAlign32` / `disable-ppc-innermost-loop-align32`
117	We don't read information from PGO any more, so maybe remove `from PGO` please.

This revision is now accepted and ready to land.Jun 13 2019, 10:51 AM

Also, please make sure that the summary and text of the commit message does not mention PGO since it is not really considered any longer.

ZhangKang retitled this revision from [PowerPC] Set the innermost hot loop(from PGO) to align 32 bytes to [PowerPC] Set the innermost hot loop to align 32 bytes.Jun 13 2019, 8:27 PM

ZhangKang edited the summary of this revision. (Show Details)

Modify the comments.

ZhangKang marked 2 inline comments as done.Jun 13 2019, 8:34 PM

steven.zhang added inline comments.Jun 13 2019, 8:50 PM

llvm/test/CodeGen/PowerPC/loop-align-pgo.ll
1 ↗	(On Diff #204690)	remove the PGO here. And please update the test case name. And please update the test to remove the PGO.

Update the patch to remove the info about PGO.

LGTM

This old test case will be failed for the latest code, so I have updated the test case.

Closed by commit rL363495: [PowerPC] Set the innermost hot loop to align 32 bytes (authored by ZhangKang). · Explain WhyJun 15 2019, 8:07 AM

This revision was automatically updated to reflect the committed changes.

Diff 204710

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 107 Lines • ▼ Show 20 Lines
cl::desc("disable setting the node scheduling preference to ILP on PPC"), cl::Hidden);		cl::desc("disable setting the node scheduling preference to ILP on PPC"), cl::Hidden);

static cl::opt<bool> DisablePPCUnaligned("disable-ppc-unaligned",		static cl::opt<bool> DisablePPCUnaligned("disable-ppc-unaligned",
cl::desc("disable unaligned load/store generation on PPC"), cl::Hidden);		cl::desc("disable unaligned load/store generation on PPC"), cl::Hidden);

static cl::opt<bool> DisableSCO("disable-ppc-sco",		static cl::opt<bool> DisableSCO("disable-ppc-sco",
cl::desc("disable sibling call optimization on ppc"), cl::Hidden);		cl::desc("disable sibling call optimization on ppc"), cl::Hidden);

		static cl::opt<bool> DisableInnermostLoopAlign32("disable-ppc-innermost-loop-align32",
		jsjiUnsubmitted Done Reply Inline Actions We only apply to innermost loops, can we use something like `DisableInnerMostLoopAlign32` / `disable-ppc-innermost-loop-align32` jsji: We only apply to innermost loops, can we use something like `DisableInnerMostLoopAlign32` /…
		cl::desc("don't always align innermost loop to 32 bytes on ppc"), cl::Hidden);
		jsjiUnsubmitted Done Reply Inline Actions We don't read information from PGO any more, so maybe remove `from PGO` please. jsji: We don't read information from PGO any more, so maybe remove `from PGO` please.

static cl::opt<bool> EnableQuadPrecision("enable-ppc-quad-precision",		static cl::opt<bool> EnableQuadPrecision("enable-ppc-quad-precision",
cl::desc("enable quad precision float support on ppc"), cl::Hidden);		cl::desc("enable quad precision float support on ppc"), cl::Hidden);

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");
STATISTIC(NumSiblingCalls, "Number of sibling calls");		STATISTIC(NumSiblingCalls, "Number of sibling calls");

static bool isNByteElemShuffleMask(ShuffleVectorSDNode *, unsigned, int);		static bool isNByteElemShuffleMask(ShuffleVectorSDNode *, unsigned, int);

▲ Show 20 Lines • Show All 13,714 Lines • ▼ Show 20 Lines	unsigned PPCTargetLowering::getPrefLoopAlignment(MachineLoop *ML) const {
case PPC::DIR_PWR6:		case PPC::DIR_PWR6:
case PPC::DIR_PWR6X:		case PPC::DIR_PWR6X:
case PPC::DIR_PWR7:		case PPC::DIR_PWR7:
case PPC::DIR_PWR8:		case PPC::DIR_PWR8:
case PPC::DIR_PWR9: {		case PPC::DIR_PWR9: {
if (!ML)		if (!ML)
break;		break;

		if (!DisableInnermostLoopAlign32) {
		// If the nested loop is an innermost loop, prefer to a 32-byte alignment,
		// so that we can decrease cache misses and branch-prediction misses.
		// Actual alignment of the loop will depend on the hotness check and other
		// logic in alignBlocks.
		if (ML->getLoopDepth() > 1 && ML->getSubLoops().empty())
		return 5;
		}

const PPCInstrInfo *TII = Subtarget.getInstrInfo();		const PPCInstrInfo *TII = Subtarget.getInstrInfo();

// For small loops (between 5 and 8 instructions), align to a 32-byte		// For small loops (between 5 and 8 instructions), align to a 32-byte
// boundary so that the entire loop fits in one instruction-cache line.		// boundary so that the entire loop fits in one instruction-cache line.
uint64_t LoopSize = 0;		uint64_t LoopSize = 0;
for (auto I = ML->block_begin(), IE = ML->block_end(); I != IE; ++I)		for (auto I = ML->block_begin(), IE = ML->block_end(); I != IE; ++I)
for (auto J = (I)->begin(), JE = (I)->end(); J != JE; ++J) {		for (auto J = (I)->begin(), JE = (I)->end(); J != JE; ++J) {
LoopSize += TII->getInstSizeInBytes(*J);		LoopSize += TII->getInstSizeInBytes(*J);
▲ Show 20 Lines • Show All 1,372 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/loop-align.ll

This file was added.

				; Test the loop alignment.
				; RUN: llc -verify-machineinstrs -mcpu=a2 -mtriple powerpc64le-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,GENERIC
				; RUN: llc -verify-machineinstrs -mcpu=pwr8 -mtriple powerpc64le-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR
				; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple powerpc64le-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR
				; RUN: llc -verify-machineinstrs -mcpu=pwr8 -mtriple powerpc64-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR
				; RUN: llc -verify-machineinstrs -mcpu=pwr9 -mtriple powerpc64-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR

				; Test the loop alignment and the option -disable-ppc-innermost-loop-align32.
				; RUN: llc -verify-machineinstrs -mcpu=a2 -disable-ppc-innermost-loop-align32 -mtriple powerpc64le-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,GENERIC-DISABLE-PPC-INNERMOST-LOOP-ALIGN32
				; RUN: llc -verify-machineinstrs -mcpu=pwr8 -disable-ppc-innermost-loop-align32 -mtriple powerpc64le-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32
				; RUN: llc -verify-machineinstrs -mcpu=pwr9 -disable-ppc-innermost-loop-align32 -mtriple powerpc64le-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32
				; RUN: llc -verify-machineinstrs -mcpu=pwr8 -disable-ppc-innermost-loop-align32 -mtriple powerpc64-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32
				; RUN: llc -verify-machineinstrs -mcpu=pwr9 -disable-ppc-innermost-loop-align32 -mtriple powerpc64-unknown-linux-gnu < %s \| FileCheck %s -check-prefixes=CHECK,PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32


				%struct.parm = type { i32*, i32, i32 }

				; Test the loop alignment when the innermost hot loop has more than 8 instructions.
				define void @big_loop(%struct.parm* %arg) {
				entry:
				%localArg.sroa.0.0..sroa_idx = getelementptr inbounds %struct.parm, %struct.parm* %arg, i64 0, i32 0
				%localArg.sroa.0.0.copyload = load i32, i32* %localArg.sroa.0.0..sroa_idx, align 8
				%localArg.sroa.4.0..sroa_idx56 = getelementptr inbounds %struct.parm, %struct.parm* %arg, i64 0, i32 1
				%localArg.sroa.4.0.copyload = load i32, i32* %localArg.sroa.4.0..sroa_idx56, align 8
				%localArg.sroa.5.0..sroa_idx58 = getelementptr inbounds %struct.parm, %struct.parm* %arg, i64 0, i32 2
				%localArg.sroa.5.0.copyload = load i32, i32* %localArg.sroa.5.0..sroa_idx58, align 4
				%0 = sext i32 %localArg.sroa.5.0.copyload to i64
				br label %do.body

				do.body: ; preds = %do.end, %entry
				%m.0 = phi i32 [ %localArg.sroa.4.0.copyload, %entry ], [ %dec24, %do.end ]
				br label %do.body3

				do.body3: ; preds = %do.body3, %do.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %do.body3 ], [ %0, %do.body ]
				%1 = add nsw i64 %indvars.iv, 2
				%arrayidx = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %1
				%2 = add nsw i64 %indvars.iv, 3
				%3 = trunc i64 %1 to i32
				%4 = add nsw i64 %indvars.iv, 4
				%arrayidx10 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %2
				%5 = trunc i64 %2 to i32
				store i32 %5, i32* %arrayidx10, align 4
				%arrayidx12 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %4
				%6 = trunc i64 %4 to i32
				store i32 %6, i32* %arrayidx12, align 4
				store i32 %3, i32* %arrayidx, align 4
				%arrayidx21 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %indvars.iv
				%7 = trunc i64 %indvars.iv to i32
				%8 = add i32 %7, 1
				store i32 %8, i32* %arrayidx21, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, -1
				%9 = icmp eq i64 %indvars.iv, 0
				br i1 %9, label %do.end, label %do.body3

				do.end: ; preds = %do.body3
				%dec24 = add nsw i32 %m.0, -1
				%tobool25 = icmp eq i32 %m.0, 0
				br i1 %tobool25, label %do.end26, label %do.body

				do.end26: ; preds = %do.end
				%arrayidx28 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %0
				store i32 0, i32* %arrayidx28, align 4
				ret void


				; CHECK-LABEL: @big_loop
				; CHECK: mtctr
				; GENERIC: .p2align 4
				; PWR: .p2align 5
				; GENERIC-DISABLE-PPC-INNERMOST-LOOP-ALIGN32: .p2align 4
				; PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32: .p2align 4
				; CHECK: bdnz
				}

				; Test the loop alignment when the innermost hot loop has 5-8 instructions.
				define void @general_loop(i32* %s, i64 %m) {
				entry:
				%tobool40 = icmp eq i64 %m, 0
				br i1 %tobool40, label %while.end18, label %while.body3.lr.ph

				while.cond.loopexit: ; preds = %while.body3
				%tobool = icmp eq i64 %dec, 0
				br i1 %tobool, label %while.end18, label %while.body3.lr.ph

				while.body3.lr.ph: ; preds = %entry, %while.cond.loopexit
				%m.addr.041 = phi i64 [ %dec, %while.cond.loopexit ], [ %m, %entry ]
				%dec = add nsw i64 %m.addr.041, -1
				%conv = trunc i64 %m.addr.041 to i32
				%conv11 = trunc i64 %dec to i32
				br label %while.body3

				while.body3: ; preds = %while.body3.lr.ph, %while.body3
				%n.039 = phi i64 [ %m.addr.041, %while.body3.lr.ph ], [ %dec16, %while.body3 ]
				%inc = add nsw i64 %n.039, 1
				%arrayidx = getelementptr inbounds i32, i32* %s, i64 %n.039
				%inc5 = add nsw i64 %n.039, 2
				%arrayidx6 = getelementptr inbounds i32, i32* %s, i64 %inc
				%sub = sub nsw i64 %dec, %inc5
				%conv7 = trunc i64 %sub to i32
				%arrayidx9 = getelementptr inbounds i32, i32* %s, i64 %inc5
				store i32 %conv7, i32* %arrayidx9, align 4
				store i32 %conv11, i32* %arrayidx6, align 4
				store i32 %conv, i32* %arrayidx, align 4
				%dec16 = add nsw i64 %n.039, -1
				%tobool2 = icmp eq i64 %dec16, 0
				br i1 %tobool2, label %while.cond.loopexit, label %while.body3

				while.end18: ; preds = %while.cond.loopexit, %entry
				ret void


				; CHECK-LABEL: @general_loop
				; CHECK: mtctr
				; GENERIC: .p2align 4
				; PWR: .p2align 5
				; GENERIC-DISABLE-PPC-INNERMOST-LOOP-ALIGN32: .p2align 4
				; PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32: .p2align 5
				; CHECK: bdnz
				}

				; Test the small loop alignment when the innermost hot loop has less than 4 instructions.
				define void @small_loop(i64 %m) {
				entry:
				br label %do.body

				do.body: ; preds = %do.end, %entry
				%m.addr.0 = phi i64 [ %m, %entry ], [ %1, %do.end ]
				br label %do.body1

				do.body1: ; preds = %do.body1, %do.body
				%n.0 = phi i64 [ %m.addr.0, %do.body ], [ %0, %do.body1 ]
				%0 = tail call i64 asm "subi $0,$0,1", "=r,0"(i64 %n.0)
				%tobool = icmp eq i64 %0, 0
				br i1 %tobool, label %do.end, label %do.body1

				do.end: ; preds = %do.body1
				%1 = tail call i64 asm "subi $1,$1,1", "=r,0"(i64 %m.addr.0)
				%tobool3 = icmp eq i64 %1, 0
				br i1 %tobool3, label %do.end4, label %do.body

				do.end4: ; preds = %do.end
				ret void


				; CHECK-LABEL: @small_loop
				; CHECK: beqlr
				; GENERIC: .p2align 4
				; PWR: .p2align 5
				; GENERIC-DISABLE-PPC-INNERMOST-LOOP-ALIGN32: .p2align 4
				; PWR-DISABLE-PPC-INNERMOST-LOOP-ALIGN32: .p2align 4
				; CHECK: bne
				}

				; Test the loop alignment when the innermost cold loop has more than 8 instructions.
				define void @big_loop_cold_innerloop(%struct.parm* %arg) {
				entry:
				%localArg.sroa.0.0..sroa_idx = getelementptr inbounds %struct.parm, %struct.parm* %arg, i64 0, i32 0
				%localArg.sroa.0.0.copyload = load i32, i32* %localArg.sroa.0.0..sroa_idx, align 8
				%localArg.sroa.4.0..sroa_idx56 = getelementptr inbounds %struct.parm, %struct.parm* %arg, i64 0, i32 1
				%localArg.sroa.4.0.copyload = load i32, i32* %localArg.sroa.4.0..sroa_idx56, align 8
				%localArg.sroa.5.0..sroa_idx58 = getelementptr inbounds %struct.parm, %struct.parm* %arg, i64 0, i32 2
				%localArg.sroa.5.0.copyload = load i32, i32* %localArg.sroa.5.0..sroa_idx58, align 4
				%0 = sext i32 %localArg.sroa.5.0.copyload to i64
				br label %do.body

				do.body: ; preds = %do.end, %entry
				%m.0 = phi i32 [ %localArg.sroa.4.0.copyload, %entry ], [ %dec24, %do.end ]
				br label %do.body3

				do.body3: ; preds = %do.body3, %do.body
				%indvars.iv = phi i64 [ %indvars.iv.next, %do.body3 ], [ %0, %do.body ]
				%1 = add nsw i64 %indvars.iv, 2
				%arrayidx = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %1
				%2 = add nsw i64 %indvars.iv, 3
				%3 = trunc i64 %1 to i32
				%4 = add nsw i64 %indvars.iv, 4
				%arrayidx10 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %2
				%5 = trunc i64 %2 to i32
				store i32 %5, i32* %arrayidx10, align 4
				%arrayidx12 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %4
				%6 = trunc i64 %4 to i32
				store i32 %6, i32* %arrayidx12, align 4
				store i32 %3, i32* %arrayidx, align 4
				%arrayidx21 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %indvars.iv
				%7 = trunc i64 %indvars.iv to i32
				%8 = add i32 %7, 1
				store i32 %8, i32* %arrayidx21, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, -1
				%9 = icmp eq i64 %indvars.iv, 0
				br i1 %9, label %do.end, label %do.body3

				do.end: ; preds = %do.body3
				%dec24 = add nsw i32 %m.0, -1
				%tobool25 = icmp eq i32 %m.0, 0
				br i1 %tobool25, label %do.end26, label %do.body

				do.end26: ; preds = %do.end
				%arrayidx28 = getelementptr inbounds i32, i32* %localArg.sroa.0.0.copyload, i64 %0
				store i32 0, i32* %arrayidx28, align 4
				ret void


				; CHECK-LABEL: @big_loop_cold_innerloop
				; CHECK: mtctr
				; PWR: .p2align 5
				; CHECK-NOT: .p2align 5
				; CHECK: bdnz
				}

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Set the innermost hot loop to align 32 bytes
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 204710

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/test/CodeGen/PowerPC/loop-align.ll

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Set the innermost hot loop to align 32 bytesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 204710

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/test/CodeGen/PowerPC/loop-align.ll

[PowerPC] Set the innermost hot loop to align 32 bytes
ClosedPublic