This is an archive of the discontinued LLVM Phabricator instance.

[test-suite] Add unit tests for vectorizer memory runtime checks.
ClosedPublic

Authored by fhahn on Feb 7 2022, 2:30 AM.

Details

Summary

This patch adds a first set of tests to check memory runtime checks
generated by the vectorizer.

The it runs scalar and vectorized versions of a loop requiring runtime
checks on the same inputs with pointers to the same buffer using various
offsets. It fails if they do not produce the same results.

The test functions are provided as lambdas, which are passed to a
driver function that generates the inputs and calls the lambdas with
pointers to overlapping buffers. The driver functions are marked as
noinline, which should act as an optimization barrier so the lambdas in
turn cannot be inlined and optimized without runtime checks.

Unfortunately 2 separate lambdas need to be specified for the scalar and
vector versions, with the only difference being the pragma to disable
vectorization. If anybody knows a nice generic & convenient way to
specify the loop once, what would be great.

Event Timeline

fhahn created this revision.Feb 7 2022, 2:30 AM
fhahn requested review of this revision.Feb 7 2022, 2:30 AM
xbolva00 added inline comments.
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
212
template <int F>
int foo(int *d, int N) {
    int s = 0;
#pragma unroll F
    for (int i = 0; i < N; ++i) {
        s += d[i];
    }

    return s;
}

int p(int *d, int N) { return foo<2>(d, N); }

Maybe similar idea can be used here? or with _Pragma..

fhahn updated this revision to Diff 406445.Feb 7 2022, 6:56 AM

Add macro to define both ScalarFn and VectorFn, given a loop.

fhahn added inline comments.Feb 7 2022, 6:58 AM
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
212

Thanks, I added a macro that generates both ScalarFn and VectorFn, given a loop body. It uses _Pragma. Alternative would be a template + vectorize_width, but template lambdas are a bit awkward.

I doubt that __attribute__((noinline)) suffices as an optimization barrier, IPO such as IPConstantPropagation/FunctionSpecialization can still take place. Even worse, the allocation are made within the check functions and the lambda being inlined such that the optimizer can see the allocation.

But do we even need the optimization barrier? How could the memory check be optimized away?

[not a change request] If we vary interleaving as well, we would not be restricted to the target architecture's vector width.

SingleSource/UnitTests/Vectorizer/CMakeLists.txt
2
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
8

Something that would have made me understand the purpose easer.

25

I assume it is not possible for NaNs to appear?

58–59

I'd have preferred some explicit expression that gives confidence in that we don't accidently go out of range e.g. when increasing the span for offsets ant to know when they actually overlap.

61–63

It might be more predictable if reusing the same allocation. Otherwise each allocation may have a different alignment and effectively some offsets (relative to virtual address space, eg. page boundaries) are skipped. In case the vectorizer adds a prologue to ensure vector memory accesses are aligned.

72

[style] No almost-always-auto

76
83–84

Do we really need that many offsets? The trip count is just 100 so everything outside [-100,100] already seems redundant.

219
Meinersbur added inline comments.Feb 9 2022, 5:31 PM
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
8

I just noticed the added "from/to same memory buffer" would be redundant.

xbolva00 added a comment.EditedFeb 9 2022, 11:43 PM

I doubt that attribute((noinline)) suffices as an optimization barrier, IPO such as IPConstantPropagation/FunctionSpecialization can still take place. Even worse, the allocation are made within the check functions and the lambda being inlined such that the optimizer can see the allocation.

Sadly, work on ‘noipa’ stalled. And we totally miss “noclone”, but maybe not a issue yet, since funcspec pass is not enabled yet.

fhahn updated this revision to Diff 407652.Feb 10 2022, 12:52 PM
fhahn marked 7 inline comments as done.

Thank you very much for taking a look! Comments should be addressed.

I doubt that attribute((noinline)) suffices as an optimization barrier, IPO such as IPConstantPropagation/FunctionSpecialization can still take place. Even worse, the allocation are made within the check functions and the lambda being inlined such that the optimizer can see the allocation.

Sadly, work on ‘noipa’ stalled. And we totally miss “noclone”, but maybe not a issue yet, since funcspec pass is not enabled yet.

My thinking was that because checkOverlappingMemoryTwoRuntimeChecks won't be inlined and called with different lambda arguments, the scalar/vector lambdas won't be inlined. It's true that this would not prevent function specialization & co from interfering, but it seems unlikely that it would be profitable.

But do we even need the optimization barrier? How could the memory check be optimized away?

In the current version, I don't think there's a need for a real barrier any longer. Previously, with the larger offset there may have been cases where AA could prove that the 2 accessed regions won't overlap.

[not a change request] If we vary interleaving as well, we would not be restricted to the target architecture's vector width.

I was wondering whether we should rely on the vectorizer choosing vectorization factors and interleave counts automatically or if we should force them instead. My reasoning for letting the compiler chose is that we get different combinations for different targets, possibly increasing coverage overall. We could chose the VF automatically and cover a range of user-provided interleave counts ?

fhahn marked 3 inline comments as done.Feb 10 2022, 12:54 PM
fhahn added inline comments.
SingleSource/UnitTests/Vectorizer/CMakeLists.txt
2

That looks better, thanks!

SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
8

Thanks, I added between reads and writes. I hope this makes it clearer.

25

At the moment it is only used with integer types. I think uniform_int_distribution also only works with integer types, but I think. we can skip testing with floating point types .

58–59

Agreed, but Im not sure how to best provide such an expression, combined with enforcing it when specifying the test loops. I've left it as is for now, but I'm more than happy to adjust it if there's a better alternative.

61–63

Sounds good, I moved it out to checkOverlappingMemoryOneRuntimeCheck. Was that what you had in mind?

72

Fixed, thanks!

76

Fixed, thanks!

83–84

Good point! Originally the tests used larger trip counts, but now [-100, 100] should be sufficient to cover all cases.

219

Updated, thanks!

fhahn marked 2 inline comments as done.Feb 15 2022, 11:05 AM

ping :)

I tried copying this into a godbolt link, and is suggest you may need to #include <functional> under some systems. There are also some warnings that might be worth cleaning up: https://godbolt.org/z/sMTjfKbGo

SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
209

Should this one have this loop unroll Pragma here? It's different to the others.

[suggestion] I am still not convinced __attribute__((noinline)) is useful and give the wrong impression that its semantics solves the problem. What should not be inlined is the VectorFn lambda. Is it possible to attach a noinline attribute to it? Or maybe only call it indirectly through a __attribute__((optnone)) function. As a suggestion, I will leave it up to you whether you think it is worth it.

[not a change request] If we vary interleaving as well, we would not be restricted to the target architecture's vector width.

I was wondering whether we should rely on the vectorizer choosing vectorization factors and interleave counts automatically or if we should force them instead. My reasoning for letting the compiler chose is that we get different combinations for different targets, possibly increasing coverage overall. We could chose the VF automatically and cover a range of user-provided interleave counts ?

[not a change request] My reasoning was to force range of VFs (or interleave count since it does not require a specific instruction set), so the test covers all possible VFs without requiring access to platforms for each of them to test. The native choice by LoopVectorize can still be tested in addition to that, in case the overlap test makes a difference.

I tried copying this into a godbolt link, and is suggest you may need to #include <functional> under some systems. There are also some warnings that might be worth cleaning up: https://godbolt.org/z/sMTjfKbGo

gcc indeed does not compile it: https://godbolt.org/z/3acnP796E It would be nice to keep the test-suite buildable with gcc as a comparison/reference.

SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
61–63

yes, thank you.

72

Can make the &Reference[NumArrayElements / 2] change too? Its the exact same as &Reference[0] + NumArrayElements / 2. &Reference[0] is only a more convoluted way to say Reference.get().

125
fhahn updated this revision to Diff 409676.Feb 17 2022, 8:50 AM
fhahn marked 3 inline comments as done.

Address latest comments, thanks!

[suggestion] I am still not convinced __attribute__((noinline)) is useful and give the wrong impression that its semantics solves the problem. What should not be inlined is the VectorFn lambda. Is it possible to attach a noinline attribute to it? Or maybe only call it indirectly through a __attribute__((optnone)) function. As a suggestion, I will leave it up to you whether you think it is worth it.

Updated to call the vector functions through a optnone function :)

[not a change request] If we vary interleaving as well, we would not be restricted to the target architecture's vector width.

I was wondering whether we should rely on the vectorizer choosing vectorization factors and interleave counts automatically or if we should force them instead. My reasoning for letting the compiler chose is that we get different combinations for different targets, possibly increasing coverage overall. We could chose the VF automatically and cover a range of user-provided interleave counts ?

[not a change request] My reasoning was to force range of VFs (or interleave count since it does not require a specific instruction set), so the test covers all possible VFs without requiring access to platforms for each of them to test. The native choice by LoopVectorize can still be tested in addition to that, in case the overlap test makes a difference.

Yeah, I think that might be worth as follow-up.

I tried copying this into a godbolt link, and is suggest you may need to #include <functional> under some systems. There are also some warnings that might be worth cleaning up: https://godbolt.org/z/sMTjfKbGo

gcc indeed does not compile it: https://godbolt.org/z/3acnP796E It would be nice to keep the test-suite buildable with gcc as a comparison/reference.

Thanks, I added the missing include (probably a slight libc++/libstdc++ difference).

fhahn added inline comments.Feb 17 2022, 8:50 AM
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
72

Missed that originally, update, thanks!

125

Adjusted, thanks!

209

I *think* Clang performs runtime unrolling here before vectorization and that's why I added the pragma. It's not strictly necessary though.

Meinersbur accepted this revision.Feb 17 2022, 10:27 AM

LGTM

SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
49–50

Could you add a description for why we are using this function?

This revision is now accepted and ready to land.Feb 17 2022, 10:27 AM
fhahn marked an inline comment as done.Feb 19 2022, 1:15 PM
fhahn added inline comments.
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
49–50

Added in the committed version, thanks!

Kai added a subscriber: Kai.Mar 16 2022, 6:30 AM
Kai added inline comments.
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
18

According to https://reviews.llvm.org/D120630 std::uniform_int_distribution<char> is UB. It gives a compile error with latest C++lib.

Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 6:30 AM
fhahn marked an inline comment as done.Mar 21 2022, 9:05 AM
fhahn added inline comments.
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
18

Thanks for the heads-up! IIUC it needs to be an unsigned type. So probably the best way to fix this is to instantiate std::uniform_int_distribution with something like uint64_t and then have the result converted?

fhahn marked an inline comment as done.Mar 28 2022, 10:31 AM
fhahn added inline comments.
SingleSource/UnitTests/Vectorizer/runtime-checks.cpp
18