Page MenuHomePhabricator

[LV] Clamp VF hint when unsafe
ClosedPublic

Authored by c-rhodes on Nov 3 2020, 8:07 AM.

Details

Summary

In the following loop the dependence distance is 2 and can only be
vectorized if the vector length is no larger than this.

void foo(int *a, int *b, int N) {
  #pragma clang loop vectorize(enable) vectorize_width(4)
  for (int i=0; i<N; ++i) {
    a[i + 2] = a[i] + b[i];
  }
}

However, when specifying a VF of 4 via a loop hint this loop is
vectorized. According to [1][2], loop hints are ignored if the
optimization is not safe to apply.

This patch introduces a check to bail of vectorization if the user
specified VF is greater than the maximum feasible VF, unless explicitly
forced with '-force-vector-width=X'.

[1] https://llvm.org/docs/LangRef.html#llvm-loop-vectorize-and-llvm-loop-interleave
[2] https://clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

Diff Detail

Event Timeline

c-rhodes created this revision.Nov 3 2020, 8:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2020, 8:07 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
c-rhodes requested review of this revision.Nov 3 2020, 8:07 AM
dmgreen added a subscriber: dmgreen.
dmgreen added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5275

I don't think computeFeasibleMaxVF is the maximum safe width. It does a lot of things and is largely based on the backend vector widths. I'm guessing that's why so many tests are changing.

It should probably be based on Legal.getMaxSafeRegisterWidth()?

fhahn added inline comments.Nov 3 2020, 9:46 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5275

It should probably be based on Legal.getMaxSafeRegisterWidth()?

computeFeasibleMaxVF should only return legal vectorization factors (using getMaxSafeRegisterWidht) I think. Given that we are free to ignore the hint, if it is not useful, why not just use the largest safe vectorization factors instead of bailing out?

Unfortunately the handling of UserVF is a bit of a mess, but I think it might be preferable to clamp the UserVF to the maximum vectorization factor in the caller of computeMaxVF where UserVF is actually used.

llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls.ll
192 ↗(On Diff #302584)

Why those test changes?

#pragma clang loop vectorize_witdh(..) ignoring safety checks is indeed bad. Instead of not vectorizing at all in this case, did you consider using min(UserVF,FeasibleVF) instead?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5275

Legal.getMaxSafeRegisterWidth() is called within computeFeasibleMaxVF. With computeFeasibleMaxVF considering additional architecture concerns, can these be just ignored?

5275

With computeFeasibleMaxVF already called later, could you store its result for both uses?

I suggest to move UserVF ? UserVF : computeFeasibleMaxVF(TC) which is duplicated multiple times below before this condition. Such as:

auto MaxVF = computeFeasibleMaxVF(TC);
if (UserVF) {
  if (UserVF > MaxVF) {
    ...
  }
  MaxVF = UserFC;
}
c-rhodes added inline comments.Nov 3 2020, 12:48 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5275

It should probably be based on Legal.getMaxSafeRegisterWidth()?

computeFeasibleMaxVF should only return legal vectorization factors (using getMaxSafeRegisterWidht) I think. Given that we are free to ignore the hint, if it is not useful, why not just use the largest safe vectorization factors instead of bailing out?

Yeah that's right, computeFeasibleMaxVF is based on Legal->getMaxSafeRegisterWidth() and bounded by the widest register according to the TTI.

Unfortunately the handling of UserVF is a bit of a mess, but I think it might be preferable to clamp the UserVF to the maximum vectorization factor in the caller of computeMaxVF where UserVF is actually used.

Sure, clamping sounds good to me and @Meinersbur suggested this as well, I'll update the patch.

5275

I don't think computeFeasibleMaxVF is the maximum safe width. It does a lot of things and is largely based on the backend vector widths. I'm guessing that's why so many tests are changing.

I did find for the X86 tests that have changed it was because of a backend vector width of 128-bit, I expect these changes will still be required when clamping the UserVF to the maximum vectorization factor as suggested.

llvm/test/Transforms/LoopVectorize/X86/libm-vector-calls.ll
192 ↗(On Diff #302584)

Why those test changes?

The maximum VF=2, it's computed as WidestRegister / WidestType where the the widest register is 128-bit.

c-rhodes updated this revision to Diff 302809.Nov 4 2020, 3:42 AM

Rather than bail out of vectorization if UserVF > MaxVF, clamp UserVF to the maximum feasible VF.

dmgreen added inline comments.Nov 4 2020, 10:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

Clamping sounds good.

But I think that there is a difference between the "maximum safe vector width" and "the maximum register bitwidth". The maximum safe distance should be a legality constraint that doesn't depend on the size of the backend registers.

It's perhaps easier to show with examples. This code should produce the same thing (the same width vectors) with and without this patch, I believe, as there is nothing "unsafe" about vectorizing at a higher bitwidth than the vector registers:
https://godbolt.org/z/ePdv3K

(I am trying to not use the term "legal", as it has too many meanings. There is a difference between "legal to vectorize" (as the safety constraints in LoopVectorizationLegality) and "legal vector widths" which just means that the llvm-ir vector can be lowered to a single vector register and I don't think should be very relevant here).

c-rhodes updated this revision to Diff 303119.Nov 5 2020, 7:39 AM

Address @dmgreen's comments, response in thread.

c-rhodes added inline comments.Nov 5 2020, 7:40 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

But I think that there is a difference between the "maximum safe vector width" and "the maximum register bitwidth". The maximum safe distance should be a legality constraint that doesn't depend on the size of the backend registers.

So only considering the VF computed by LAA for dependencies rather the backend register widths, I think that makes sense.

Whilst looking into this I discovered the example I gave in the commit message doesn't actually vectorize when only specifying -force-vector-width=4 and no loop hint:

void foo(int *a, int *b, int N) {
  for (int i=0; i<N; ++i) {
    a[i + 2] = a[i] + b[i];
  }
}

clang -S -emit-llvm -o - ../dependence.c -O3 -Rpass-missed=loop-vectorize -mllvm -force-vector-width=4
../dependence.c:2:3: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
  for (int i=0; i<N; ++i) {
  ^

It's a bit of a mess how UserVF is handled, it seems LAA only knows about the UserVF specified by -force-vector-width. With the pragma LAA is operating on VF=2 and LV on VF=4. What's also interesting is the loop metadata takes precedence over the flag since in LoopVectorizationLegality the vector width is initialized with the flag then populated with loop metadata, so the VF according to LAA would come from the flag and in LV the loop hint, assuming both were specified by the user that is.

I've updated the patch such that computeFeasibleMaxVF now takes UserVF and clamps this to VF based on Legal->getMaxSafeRegisterWidth(); if it exceeds it. This simplifies the patch a fair bit since no existing tests change, although I think the handling of UserVF needs refactoring. Now I know -force-vector-width won't apply unless it's safe, the loop hint and flag feel semantically equivalent but I'm not sure if there's anything I'm missing. Any thoughts?

sdesmalen added inline comments.Nov 5 2020, 7:46 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

That's a good point @dmgreen, MaxSafeRegisterWidth and corresponding getMaxSafeRegisterWidth in LoopVectorize.cpp and LoopAccessAnalysis.cpp are actually misnomers because it isn't the maximum safe register width that is calculated, but rather the maximum safe vector bitwidth.

Without this patch, this example is vectorized with VF=8 when compiling for Neon (128bit vectors):

void foo(int *a, int *b, int c, int N) {
  #pragma clang loop vectorize(enable) vectorize_width(8)
  for (int i=0; i<N; ++i) {
    a[i + 16] = a[i] + b[i];
  }
}

Where with this patch, it is now vectorized with VF=4. It seems like the limitation with regards to the actual physical vector register can and should be removed.

sdesmalen added inline comments.Nov 5 2020, 7:53 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

the loop hint and flag feel semantically equivalent but I'm not sure if there's anything I'm missing. Any thoughts?

To me they read as slightly different things, but it's good to clear their semantics up.

I interpret:

  • -force-vector-width as "vectorize it with this width and this width only, and fail if not legal".
  • The LoopHint as "try to vectorize with this width but if not legal, feel free to ignore the hint and pick a different width".

At least, that's what the implementation does today and I assumed that was deliberate. Maybe someone else can clarify that?

Meinersbur added inline comments.Nov 9 2020, 4:26 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

-force-vector-width is internal and should be used for regression tests only. Whether it applies a different vector width or does not vectorize at all, the regression tests should fail. Because it is internal, we it shouldn't matter which semantics we chose. Personally, I'd chose the LoopHint semantics to reduce the number of code paths.

5382

[style] LLVM coding style prefers explicit types in declarations.

5386–5388

Should there be (also) a diagnostic warning (-Rpass) to inform the user that the value has been clamped? (Or maybe there is already and I don't see where it is done)

fhahn added inline comments.Nov 10 2020, 2:35 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

I've updated the patch such that computeFeasibleMaxVF now takes UserVF and clamps this to VF based on Legal->getMaxSafeRegisterWidth(); if it exceeds it. This simplifies the patch a fair bit since no existing tests change, although I think the handling of UserVF needs refactoring. Now I know -force-vector-width won't apply unless it's safe, the loop hint and flag feel semantically equivalent but I'm not sure if there's anything I'm missing. Any thoughts?

The logic looks good I think. My only concern is that passing UserVF even further down seems to make handling of it even more complicated to follow, but I don't think there's a good way around that because we need some info not available here.

At least, that's what the implementation does today and I assumed that was deliberate. Maybe someone else can clarify that?

-force-vector-width is internal and should be used for regression tests only. Whether it applies a different vector width or does not vectorize at all, the regression tests should fail. Because it is internal, we it shouldn't matter which semantics we chose. Personally, I'd chose the LoopHint semantics to reduce the number of code paths.

Yes that is how it is used today AFAIK, to write LV tests independent of cost-modeling. Agreed with @Meinersbur that it should not matter for regression tests, because I think it is mostly used for testing codegen in legal scenarios. If we decide to adjust the behavior, that's probably best done in a separate patch.

5386–5388

I think we currently would only issue a remark with the chosen VF, but it would probably good to add a separate remark so it is easy for users to spot (OptimizationRemarkAnalysis seems to be a suitable category here).

c-rhodes updated this revision to Diff 304217.Nov 10 2020, 8:52 AM
c-rhodes retitled this revision from [LV] Ignore VF hint when unsafe to [LV] Clamp VF hint when unsafe.
  • Fix style issue.
  • Add optimization remark when clamping.
c-rhodes marked 3 inline comments as done.Nov 10 2020, 9:03 AM
c-rhodes added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5274

@fhahn @Meinersbur thanks for clarifying

5386–5388

I think we currently would only issue a remark with the chosen VF, but it would probably good to add a separate remark so it is easy for users to spot (OptimizationRemarkAnalysis seems to be a suitable category here).

Done, thanks for pointing out OptimizationRemarkAnalysis

Meinersbur accepted this revision.Nov 10 2020, 1:46 PM

Thanks, looks good to me.

This revision is now accepted and ready to land.Nov 10 2020, 1:46 PM
fhahn accepted this revision.Nov 10 2020, 2:48 PM

LGTM as well, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5387

nit: it seems like the message here and of the remark slightly diverged. I think it would be worth using the same 'clamping' wording as in the remark here.

llvm/test/Transforms/LoopVectorize/unsafe-vf-remark.ll
3 ↗(On Diff #304217)

It might also be interesting to add a test cases where the user provided VF is large (say 64), the max legal width is something like 32 and the profitable width selected by the cost model is something smaller (might be easier if this is a target-specific test for a specific architecture ).

sdesmalen accepted this revision.Nov 11 2020, 12:32 AM

LGTM as well. It seems the current patch fixes the example I pasted in my previous comment.

LGTM as well. It seems the current patch fixes the example I pasted in my previous comment.

Yeah, Thanks for that.

c-rhodes added inline comments.Nov 11 2020, 5:32 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5387

nit: it seems like the message here and of the remark slightly diverged. I think it would be worth using the same 'clamping' wording as in the remark here.

Good spot, I'll fix it before merging or update this patch once I get a better idea about the other suggestion you made.

llvm/test/Transforms/LoopVectorize/unsafe-vf-remark.ll
3 ↗(On Diff #304217)

It might also be interesting to add a test cases where the user provided VF is large (say 64), the max legal width is something like 32 and the profitable width selected by the cost model is something smaller (might be easier if this is a target-specific test for a specific architecture ).

Is this loop like what you had in mind?

void foo(int *a, int *b, int N) {
  #pragma clang loop vectorize(enable) vectorize_width(64)
  for (int i=0; i<N; ++i) {
    a[i + 32] = a[i] / b[i];
  }
}

When compiling with:

./bin/clang -S -emit-llvm -o - ../dependence.c -O2 -mllvm -debug-only=loop-vectorize,loop-accesses -target aarch64-linux-gnu

The user VF of 64 is unsafe so it's clamped to 32 and the vector loop of width 32 is more expensive (cost 13) than the scalar loop (cost 10), although the vectorization is forced so the VF=32 is still chosen.

c-rhodes updated this revision to Diff 306069.Nov 18 2020, 5:31 AM
  • Rebased (computeFeasibleMaxVF now returns an ElementCount).
  • Address one of @fhahn's comments.
  • Added an assert at the top of computeFeasibleMaxVF that UserVF isn't scalable.
c-rhodes marked an inline comment as done.Nov 18 2020, 5:31 AM
c-rhodes added inline comments.Nov 20 2020, 3:24 AM
llvm/test/Transforms/LoopVectorize/unsafe-vf-remark.ll
3 ↗(On Diff #304217)

It might also be interesting to add a test cases where the user provided VF is large (say 64), the max legal width is something like 32 and the profitable width selected by the cost model is something smaller (might be easier if this is a target-specific test for a specific architecture ).

I might be missing something but I don't think this test will add any value, although I'm not comfortable landing this patch until this is resolved. I suggest landing this patch as is, @fhahn unless you feel strongly about it?

fhahn added inline comments.Nov 24 2020, 11:37 AM
llvm/test/Transforms/LoopVectorize/unsafe-vf-remark.ll
3 ↗(On Diff #304217)

It might also be interesting to add a test cases where the user provided VF is large (say 64), the max legal width is something like 32 and the profitable width selected by the cost model is something smaller (might be easier if this is a target-specific test for a specific architecture ).

I might be missing something but I don't think this test will add any value, although I'm not comfortable landing this patch until this is resolved. I suggest landing this patch as is, @fhahn unless you feel strongly about it?

Agreed, such a test doesn't really add much. What I was suggesting was one where the cost model does pick a different VF than the maximum safe one. This is the case that should be handled differently with the current version compared to the first version.

I was thinking about a test like the one below. It requests VF = 32, but only 16 is safe. When built with opt -loop-vectorize -mtriple=arm64-apple-iphoneos, the cost model should pick VF = 2 instead of the higher alternatives.

define void @test(i64* nocapture %a, i64* nocapture readonly %b) {
entry:
  br label %loop.header

loop.header:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %latch ]
  %arrayidx = getelementptr inbounds i64, i64* %a, i64 %iv
  %0 = load i64, i64* %arrayidx, align 4
  %arrayidx2 = getelementptr inbounds i64, i64* %b, i64 %iv
  %1 = load i64, i64* %arrayidx2, align 4
  %add = add nsw i64 %1, %0
  %2 = add nuw nsw i64 %iv, 16
  %arrayidx5 = getelementptr inbounds i64, i64* %a, i64 %2
  %c = icmp eq i64 %1, 120
  br i1 %c, label %then, label %latch

then:
  store i64 %add, i64* %arrayidx5, align 4
  br label %latch

latch:
  %iv.next = add nuw nsw i64 %iv, 1
  %exitcond.not = icmp eq i64 %iv.next, 1024
  br i1 %exitcond.not, label %exit, label %loop.header, !llvm.loop !0

exit:
  ret void
}

!0 = !{!0, !1, !2}
!1 = !{!"llvm.loop.vectorize.width", i64 32}
!2 = !{!"llvm.loop.vectorize.enable", i1 true}
20 ↗(On Diff #304217)

nit: Is this required? LV will add a minimum iteration check anyways, so this could be dropped to make the IR more compact? Alternatively we could also use a constant trip count.

24 ↗(On Diff #304217)

nit: Is this required? We could just change %N to be a i64 to make the IR more compact.

31 ↗(On Diff #304217)

nit: can we drop the indvars. prefix to make the IR slightly more readable?

c-rhodes updated this revision to Diff 307575.Nov 25 2020, 4:01 AM

Address Florian's comments

c-rhodes marked 2 inline comments as done.Nov 25 2020, 4:03 AM
c-rhodes added inline comments.
llvm/test/Transforms/LoopVectorize/unsafe-vf-remark.ll
3 ↗(On Diff #304217)

I was thinking about a test like the one below. It requests VF = 32, but only 16 is safe. When built with opt -loop-vectorize -mtriple=arm64-apple-iphoneos, the cost model should pick VF = 2 instead of the higher alternatives.

That makes sense cheers, I've added the test

20 ↗(On Diff #304217)

nit: Is this required? LV will add a minimum iteration check anyways, so this could be dropped to make the IR more compact? Alternatively we could also use a constant trip count.

Changed it to use a constant trip count

I think I've addressed all comments now so I'll land this at the beginning of next week unless there's any objections between then, thanks for reviewing all!

This revision was automatically updated to reflect the committed changes.
iajbar reopened this revision.Jan 15 2021, 8:53 AM
iajbar added a subscriber: iajbar.

I configured and built llvm with:
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD:STRING=Hexagon -DLLVM_DEFAULT_TARGET_TRIPLE:STRING=hexagon-unknown-elf -DLLVM_TARGET_ARCH:STRING=hexagon-unknown-elf -DLLVM_ENABLE_ASSERTIONS:BOOL=ON -DLLVM_PARALLEL_LINK_JOBS=1 '-DLLVM_ENABLE_PROJECTS=llvm;clang' -DBUILD_SHARED_LIBS:BOOL=ON ..

This patch causes an assert with this testcase:
typedef struct {

char a;

} b;
b *c;
int d, e;
int f() {

int g = 0;
for (; d; d++) {
  e = 0;
  for (; e < c[d].a; e++)
    g++;
}
return g;

}
clang -Os -mhvx -fvectorize -mv67 testcase.i -S -o -

llvm::LoopVectorizationCostModel::computeMaxVF(llvm::ElementCount, unsigned int): Assertion `WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() && "No decisions should have been taken at this point"' failed.

This revision is now accepted and ready to land.Jan 15 2021, 8:53 AM

clang can also be configured for all targets
cmake -G Ninja -DLLVM_ENABLE_ASSERTIONS:BOOL=ON -DLLVM_PARALLEL_LINK_JOBS=1 '-DLLVM_ENABLE_PROJECTS=llvm;clang' -DBUILD_SHARED_LIBS:BOOL=ON ..

clang -Os -target hexagon -mhvx -fvectorize -mv67 testcase.i -S -o -

I configured and built llvm with:
cmake -G Ninja -DLLVM_TARGETS_TO_BUILD:STRING=Hexagon -DLLVM_DEFAULT_TARGET_TRIPLE:STRING=hexagon-unknown-elf -DLLVM_TARGET_ARCH:STRING=hexagon-unknown-elf -DLLVM_ENABLE_ASSERTIONS:BOOL=ON -DLLVM_PARALLEL_LINK_JOBS=1 '-DLLVM_ENABLE_PROJECTS=llvm;clang' -DBUILD_SHARED_LIBS:BOOL=ON ..

This patch causes an assert with this testcase:
typedef struct {

char a;

} b;
b *c;
int d, e;
int f() {

int g = 0;
for (; d; d++) {
  e = 0;
  for (; e < c[d].a; e++)
    g++;
}
return g;

}
clang -Os -mhvx -fvectorize -mv67 testcase.i -S -o -

llvm::LoopVectorizationCostModel::computeMaxVF(llvm::ElementCount, unsigned int): Assertion `WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() && "No decisions should have been taken at this point"' failed.

Thanks for reporting. I reproduced the crash but found it also occurs before this landed (I tried commit eeba70a), so I'm not sure it's this patch causing the issue. Could you confirm this?

The patch that causes the assert is patch cba4accda08f90. I tried commit c63799fc52ff247 that is before your patch and there is no assert. Thank you.

The patch that causes the assert is patch cba4accda08f90. I tried commit c63799fc52ff247 that is before your patch and there is no assert. Thank you.

Apologies, I got this patch mixed up with D91718. You're right the crash is introduced by this patch. I've posted a fix D94869, see patch for more details.

Unrelated to this issue but might be of interest to you, I hit a crash:

Loop IV: clang: /home/culrho01/llvm-project/llvm/include/llvm/Support/Casting.h:104: static bool llvm::isa_impl_cl<To, const From*>::doit(const From*) [with To = llvm::Instruction; From = llvm::Value]: Assertion `Val && "isa<> used on a null pointer"' failed.

when compiling your testcase with -mllvm -debug. Full invocation: ./bin/clang -Os -mhvx -fvectorize -mv67 ../testcase.c -S -o - -mllvm -debug.

bjope added a subscriber: bjope.Jan 17 2021, 12:56 AM

Thanks Cullen! I tried your fix D94869 and my benchmark passes.

c-rhodes closed this revision.Feb 1 2021, 5:19 AM

Thanks Cullen! I tried your fix D94869 and my benchmark passes.

Fix has landed, closing this again.