This is an archive of the discontinued LLVM Phabricator instance.

[LoopDataPrefetch/AArch64] Allow selective prefetching of irregular symbolic strided accesses
AbandonedPublic

Authored by bmakam on Sep 22 2016, 9:50 AM.

Details

Reviewers
anemet
mcrosier
Summary

Irregular streams typically consist of symbolic strided accesses.

For example:

struct MyStruct {
int field;
char kk[60];
} *my_struct;

int f(struct MyStruct *p, struct MyStruct *q, int N) {
  int total = 0;
  struct MyStruct *r = p;
  for (int i = 0; i < N/300; ++i)
    for (r = p + i; r < q; r += N)
      total += r->field;
  return total;
}

This software prefetching scheme looks for such irregular symbolic strides and prefetches
'PrefetchDegree' cache lines ahead of the visiting load address. Adds a TTI interface
'getPrefetchDegree' to query the target parameter.

This is currently enabled for Kryo only. A target would have to provide this information
to opt in to prefetch 'PrefetchDegree' cache lines ahead.

This improves performance of these Spec200x benchmarks on Kryo:

BenchmarkDiff (%)
spec2006/povray:ref+1.738%
spec2006/gcc:ref+1.749%
spec2006/mcf:ref+7.936%
spec2000/gap:ref+16.51%

Diff Detail

Event Timeline

bmakam updated this revision to Diff 72187.Sep 22 2016, 9:50 AM
bmakam retitled this revision from to [LoopDataPrefetch/AArch64] Allow selective prefetching of symbolic strided accesses.
bmakam updated this object.
bmakam added reviewers: anemet, mcrosier.
bmakam added a subscriber: llvm-commits.
anemet edited edge metadata.Sep 22 2016, 9:57 AM

Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:

for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];

There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.

Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:

for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];

There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.

Sorry for the confusion. What I meant was address of B[i] can be represented as i*sizeof(B), so sizeof(B) is the loop-invariant stride here.

bmakam updated this object.Sep 22 2016, 11:20 AM
bmakam edited edge metadata.
mssimpso added inline comments.Sep 23 2016, 11:06 AM
lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201

Would it be possible to reuse llvm::getStrideFromPointer instead of re-implementing some of the logic here?

bmakam added inline comments.Sep 23 2016, 11:18 AM
lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201

It seems like llvm::getStrideFromPointer is not suitable for reuse here, especially it somehow assumes that the PtrAccessSize is always 1 and returns null when PtrAccessSize != StepVal. For all the interesting cases this scheme targets the stepval is a non-unit constant.

Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:

for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];

There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.

Sorry for the confusion. What I meant was address of B[i] can be represented as i*sizeof(B), so sizeof(B) is the loop-invariant stride here.

Sorry but I still don't understand this. Can you please elaborate on where the A[B[i]] style access is in this loop?

bmakam added a comment.EditedSep 23 2016, 11:52 AM

Irregular streams typically consist of array accesses in which a subscripted variable
appears in one of the subscript positions, such as: A[B[i]].

For example:

for (unsigned i = 0; i < 100; i++)
  A[i + 1] = A[Stride * i] + B[i];

There is something confusing here. Is Stride loop-variant here? Otherwise I don't see how this is a A[B[i]]-style access.

Sorry for the confusion. What I meant was address of B[i] can be represented as i*sizeof(B), so sizeof(B) is the loop-invariant stride here.

Sorry but I still don't understand this. Can you please elaborate on where the A[B[i]] style access is in this loop?

Sorry, this is not for a A[B[i]] style access. This is currently only restricted to a symbolic stride access. I plan to expand this to other irregular patterns such as A[B[i]] style access, hash maps like A[f(i)]->B style access and recursive data structures like A->next style accesses, but it will require some more testing.

mssimpso added inline comments.Sep 23 2016, 11:57 AM
lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201

That's just for the non-gep case, though, right? For example, if the pointer operand of a load/store is a pointer induction variable instead of a gep. In that case it checks that the pointer operand is an addrec like Ptr + V, where V is non-constant. Shouldn't that work?

bmakam updated this object.Sep 26 2016, 3:12 AM
bmakam updated this revision to Diff 72451.Sep 26 2016, 3:15 AM

Added more testscases.

bmakam added inline comments.Sep 26 2016, 3:20 AM
lib/Transforms/Scalar/LoopDataPrefetch.cpp
178–201

Sometimes GEP is a pointer but not an index, for example in mcf's case the Access Ptr is:
%128 = getelementptr inbounds %struct.arc, %struct.arc* %111, i64 0, i32 0

and the SCEV is:
{((64 * %105)<nsw> + %1)<nsw>,+,(64 * %100)<nsw>}<nw><%108>

Here 64 is the size of %struct.arc and is a non-unit constant. I have added a test case to show where this won't work with llvm::getStrideFromPointer

bmakam updated this revision to Diff 72453.Sep 26 2016, 3:23 AM

Hi Balaram,

This seems like a well made patch. Correctly enabling the feature, using the pre-fetch when it's profitable and with good tests.

I'll leave the remaining of the reviews and approval to Adam et al, but from my side, the change looks good.

cheers,
--renato

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll
7

Don't force the CPU here, we have the -prefetch-degree for that. Once we have a CPU that pre-fetches aren't profitable, we can use Kryo vs that one as an example, *in addition* to the flag-based ones.

Oh, I forgot, you state all the improvements on SPEC, did you have any regressions? What's the overall geomean?

I was expecting some regressions when your prefetch guess was wrong (there are always those cases), so this may indicate a slightly better heuristics when enabling this for other cores / workloads.

cheers,
--renato

Hi Balaram,

This seems like a well made patch. Correctly enabling the feature, using the pre-fetch when it's profitable and with good tests.

I'll leave the remaining of the reviews and approval to Adam et al, but from my side, the change looks good.

Thanks for the review, Renato.

Oh, I forgot, you state all the improvements on SPEC, did you have any regressions? What's the overall geomean?

I was expecting some regressions when your prefetch guess was wrong (there are always those cases), so this may indicate a slightly better heuristics when enabling this for other cores > / workloads.

I have included all the gains and regressions in the summary. Only regression I saw on Kryo was a -1.28% in spec2006/h264ref and rest all were solid gains.

Thanks,
Balaram

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll
7

The LoopDataPrefetch pass is gated by targets that have PrefetchDistance set, so I am forcing the CPU to enable the software prefetcher. The -prefetch-degree flag is to enable the next line prefetching heuristic for targets that set the PrefetchDegree which is only Kryo currently.

I have included all the gains and regressions in the summary. Only regression I saw on Kryo was a -1.28% in spec2006/h264ref and rest all were solid gains.

Oh, I missed the negative result. :)

test/Transforms/LoopDataPrefetch/AArch64/kryo-large-stride.ll
7

Right, makes sense. Thanks!

@anemet/@mssimpso,

Do you have any additional comments on this change?

@anemet/@mssimpso,

Do you have any additional comments on this change?

Not yet, will look at it later today hopefully.

anemet requested changes to this revision.Sep 28 2016, 8:33 PM
anemet edited edge metadata.
anemet added inline comments.
lib/Target/AArch64/AArch64Subtarget.cpp
76–77

So you are saying on one hand (MinPrefetchStride = 1024) that we shouldn't bother prefetching unless the stride is at least 1K but then you say (PrefetchDegree = 1) that you want to prefetch the very next cache line anytime the stride is not known at compile time.

I feel that there is a contradiction here. The former suggest that you have a pretty powerful HW prefetcher, the latter that you don't and willing to speculate aggressively to compensate for it.

It seems that something is wrong with the model here.

This revision now requires changes to proceed.Sep 28 2016, 8:33 PM
bmakam added inline comments.Sep 28 2016, 10:21 PM
lib/Target/AArch64/AArch64Subtarget.cpp
76–77

Thanks for the review, Adam.

You are right, If the stride is unknown at compile time we speculate and prefetch the next cache line and if the stride is known we do not need to speculate so we conservatively prefetch for strides > 1K.

Sorry, I do not see a contradiction here. Are you suggesting to insert runtime checks to determine that the unknown stride is > 1K and only then prefetch? I just do not see a justification for the additional complexity and probably it might hurt performance due to the runtime checks.

anemet added inline comments.Sep 28 2016, 10:39 PM
lib/Target/AArch64/AArch64Subtarget.cpp
76–77

No I am not suggesting run-time checks.

Accesses with unknown strides are *only* omitted if MinPrefetchStrides is set (since we can't compare them at compile time). It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that?

To explain the contradiction a bit more, your model says that you *have* a HW perfetcher that is able to track regular strides < 1024.

Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024.

bmakam added inline comments.Sep 28 2016, 11:06 PM
lib/Target/AArch64/AArch64Subtarget.cpp
76–77

It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that?

We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K.

Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024

The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream
a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves.


[1] When Prefetching Works, When It Doesn’t, and Why. JAEKYU LEE et. al.

anemet added inline comments.Sep 28 2016, 11:23 PM
lib/Target/AArch64/AArch64Subtarget.cpp
76–77

We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K.

The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case.

The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream
a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves.

Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic.

bmakam added inline comments.Sep 28 2016, 11:39 PM
lib/Target/AArch64/AArch64Subtarget.cpp
76–77

The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case.

Yes, but this heuristic only improves spec2006/milc.

Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic.

The chance that the data in the next cache line will be soon accessed is high when the access pattern is irregular (unknown compile time stride). This is the situation I am analyzing for and the chance seems pretty high as it improves several benchmarks not just spec2006/mcf.

bmakam updated this revision to Diff 73591.Oct 4 2016, 10:31 PM
bmakam updated this object.
bmakam edited edge metadata.

Restricted to only irregular symbolic strides such as those found in spec2006/mcf and spec2000/gap. Please take a look.

bmakam retitled this revision from [LoopDataPrefetch/AArch64] Allow selective prefetching of symbolic strided accesses to [LoopDataPrefetch/AArch64] Allow selective prefetching of irregular symbolic strided accesses.Oct 4 2016, 10:38 PM
bmakam edited edge metadata.
bmakam updated this revision to Diff 74403.Oct 12 2016, 11:02 AM
bmakam updated this object.

Rebase and ping.

bmakam planned changes to this revision.Nov 23 2016, 7:50 AM
bmakam abandoned this revision.Jan 26 2017, 11:04 AM