Page MenuHomePhabricator

[PowerPC] Set the innermost hot loop to align 32 bytes
ClosedPublic

Authored by ZhangKang on Apr 27 2019, 9:41 AM.

Details

Summary

If the nested loop is an innermost loop, prefer to a 32-byte alignment, so that we can decrease cache misses and branch-prediction misses. Actual alignment of the loop will depend on the hotness check and other logic in alignBlocks.

The old code will only align hot loop to 32 bytes when the LoopSize larger than 16 bytes and smaller than 32 bytes, this patch will align the innermost hot loop to 32 bytes not only for the hot loop whose size is 16~32 bytes.

For some special cases, the performance can improve more than 30% after adding the patch for ppc.

This patch have a dependency on the patch D61227: [NFC]][PowerPC] Use -check-prefixes to simplify the check in code-align.ll.

Diff Detail

Event Timeline

ZhangKang created this revision.Apr 27 2019, 9:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2019, 9:41 AM
ZhangKang edited the summary of this revision. (Show Details)Apr 27 2019, 9:44 AM

For some special cases, the performance can improve more than 30% after adding the patch for ppc.

Any significant regressions?

For some special cases, the performance can improve more than 30% after adding the patch for ppc.

Any significant regressions?

@hfinkel , I have run spec test after adding this patch, the performance is same. The old code will alignment the loop whose size is more than 32 bytes to align 16 bytes, and if we want get a better performance, the outer loop need very large, and the innermost loop need very small. I think there are few cases meet the condition(the outer loop is very large, and the innermost loop is very small.).

I say that "For some special cases, the performance can improve more than 30%" is for the special small case I write. Below is the test case on P9:

cat foo.c

cpp
struct parm {
  int *arr;
  int m;
  int n;
};
void foo(struct parm *arg) {
  struct parm localArg = *arg;
  int m = localArg.m;
  int *s = localArg.arr;
  int n = localArg.n;
  do{
    int k = n;
    do{
      s[++k] = k++;
      s[k++] = k;
      s[k++] = k;
      s[k] = k;
      s[--k] = k--;
      s[k--] = k;
      s[--k] = k;
    }while(k--);
  } while(m--);

  s[n]=0;
}

cat main.c

cpp
struct parm {
  int *arr;
  int m;
  int n;
};
void foo(struct parm*);
int main() {
  int a[5000];
  struct parm arg = {a, 2000000000, 5};
  foo(&arg);
  return 0;
}

cat run.ksh

shell
set -x
# profile-generate
rm t t.* t_* *.o *.s *.profraw *.profdata
clang -c main.c -O -fprofile-generate
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-generate
clang -o t main.o foo.s -fprofile-generate
objdump -dr t > t.dis
time -p ./t

# merge
llvm-profdata merge *.profraw -output=merge.profdata

# profile-use
clang -c main.c -O -fprofile-use=merge.profdata
clang -S foo.c -O -fno-vectorize -mllvm -unroll-count=0 -fprofile-use=merge.profdata
clang -o t_pgo main.o foo.s -fprofile-use=merge.profdata
objdump -dr t_pgo > t_pgo.dis
time -p ./t_pgo

The origin result(not set loop align to 32 bytes) is below:

real 21.74
user 21.74
sys 0.00

After adding the patch, the result is below:

real 14.37
user 14.37
sys 0.00

Note that, the performance speedup rate is different when using different parameters to call the foo function. In general, if the outer loop is larget and the inner loop is smaller, the branch prediction is more likely failed, and the speedup rate is larger. For example, the speedup rate of foo(a, 2000000000, 5) is larger than foo(a, 20000000, 500).

@hfinkel , do you have any other comments?

@hfinkel , do you have any other comments?

Can you explain why we don't always do this? You're checking for profiling data but then not using it?

Can you explain why we don't always do this? You're checking for profiling data but then not using it?

I think the idea is that with PGO data, we are more certain that the hotness information is actually meaningful - since the alignment directive will only be emitted for loops that are "hot". Without PGO data, we will align a majority of loops which may be overkill. But this should be clearly stated in the comment though.

Can you explain why we don't always do this? You're checking for profiling data but then not using it?

I think the idea is that with PGO data, we are more certain that the hotness information is actually meaningful - since the alignment directive will only be emitted for loops that are "hot". Without PGO data, we will align a majority of loops which may be overkill. But this should be clearly stated in the comment though.

Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Short story: yes, I agree that we should probably just do this regardless of PGO if we don't see any significant performance regressions on important benchmarks.

TL; DR;
The hotness of the loop is checked in MachineBlockPlacement::alignBlocks(). I think the concern with aligning all loops statically determined to be "hot" to 32-bytes runs the risk of a pathologically bad case such as the following:

for (int i = 0; i < HugeValue; i++) {
  // Enough instructions to make the inner loop fall one instruction past a 32-byte boundary
  for (int j = 0; j < UnpredictableValueHighlyLikelyToBeZero; j++)
    // Do something short
}

Such a case would end up with 7 nops to align the inner loop which would presumably tie up dispatch slots. All that being said, I just ran an experiment with exactly that pathological case and the performance degrades by about 1% (which may even be in the noise).

ZhangKang added a comment.EditedMay 30 2019, 7:08 PM

Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Short story: yes, I agree that we should probably just do this regardless of PGO if we don't see any significant performance regressions on important benchmarks.

TL; DR;
The hotness of the loop is checked in MachineBlockPlacement::alignBlocks(). I think the concern with aligning all loops statically determined to be "hot" to 32-bytes runs the risk of a pathologically bad case such as the following:

for (int i = 0; i < HugeValue; i++) {
  // Enough instructions to make the inner loop fall one instruction past a 32-byte boundary
  for (int j = 0; j < UnpredictableValueHighlyLikelyToBeZero; j++)
    // Do something short
}

Such a case would end up with 7 nops to align the inner loop which would presumably tie up dispatch slots. All that being said, I just ran an experiment with exactly that pathological case and the performance degrades by about 1% (which may even be in the noise).

@nemanjai @hfinkel , I have tested the spec performance without PGO for this patch, there is no performance regressions for alignment 32 without PGO.

For the 14 spec tests I tested, the size of case 554.roms_r has increased 23% after align to 32 bytes for the base test. The size of other cases are same.

Okay, but we just call MBB->getParent()->getFunction().hasProfileData(), where do we actually check that the loop is hot?

Also, even if we align the majority of loops, how much does that really cost us? The code-size impact could be minor compared to the perf improvement, and if so, we should just always do it. It is still true that most users don't use PGO.

Short story: yes, I agree that we should probably just do this regardless of PGO if we don't see any significant performance regressions on important benchmarks.

TL; DR;
The hotness of the loop is checked in MachineBlockPlacement::alignBlocks(). I think the concern with aligning all loops statically determined to be "hot" to 32-bytes runs the risk of a pathologically bad case such as the following:

for (int i = 0; i < HugeValue; i++) {
  // Enough instructions to make the inner loop fall one instruction past a 32-byte boundary
  for (int j = 0; j < UnpredictableValueHighlyLikelyToBeZero; j++)
    // Do something short
}

Such a case would end up with 7 nops to align the inner loop which would presumably tie up dispatch slots. All that being said, I just ran an experiment with exactly that pathological case and the performance degrades by about 1% (which may even be in the noise).

@nemanjai , I have tested the case you give, after doing the alignment to 32 bytes without PGO, the performance doesn't degrade, I think we'd better align the inner loop without PGO data. What do you think?

ZhangKang updated this revision to Diff 204550.Jun 13 2019, 8:23 AM

Updated the patch to loop 32 bytes for innermost hot loop even if there is no PGO data.

@nemanjai @hfinkel I have updated the patch to align to 32 bytes even if wthout PGO data.

@nemanjai @hfinkel I have updated the patch to align to 32 bytes even if wthout PGO data.

Thanks for all of the additional benchmarking. LGTM.

jsji accepted this revision.Jun 13 2019, 10:51 AM

LGTM, some comments /renaming can be done before committing. Thanks.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
116

We only apply to innermost loops, can we use something like DisableInnerMostLoopAlign32 / disable-ppc-innermost-loop-align32

117

We don't read information from PGO any more, so maybe remove from PGO please.

This revision is now accepted and ready to land.Jun 13 2019, 10:51 AM

Also, please make sure that the summary and text of the commit message does not mention PGO since it is not really considered any longer.

ZhangKang retitled this revision from [PowerPC] Set the innermost hot loop(from PGO) to align 32 bytes to [PowerPC] Set the innermost hot loop to align 32 bytes.Jun 13 2019, 8:27 PM
ZhangKang edited the summary of this revision. (Show Details)
ZhangKang updated this revision to Diff 204690.Jun 13 2019, 8:32 PM

Modify the comments.

ZhangKang marked 2 inline comments as done.Jun 13 2019, 8:34 PM
steven.zhang added inline comments.Jun 13 2019, 8:50 PM
llvm/test/CodeGen/PowerPC/loop-align-pgo.ll
1 ↗(On Diff #204690)

remove the PGO here. And please update the test case name. And please update the test to remove the PGO.

Update the patch to remove the info about PGO.

ZhangKang updated this revision to Diff 204919.Jun 15 2019, 7:55 AM

This old test case will be failed for the latest code, so I have updated the test case.

This revision was automatically updated to reflect the committed changes.