Page MenuHomePhabricator

[SLP] Schedule only sub-graph of vectorizable instructions
ClosedPublic

Authored by reames on Jan 29 2022, 7:50 AM.

Details

Summary

SLP currently schedules all instructions within a scheduling window which stretches from the first instruction potentially vectorized to the last. This window can include a very large number of unrelated instructions which are not being considered for vectorization. This change switches the code to only schedule the sub-graph consisting of the instructions being vectorized and their transitive users.

This has the effect of greatly reducing the amount of work performed in large basic blocks, and thus greatly improves compile time on degenerate examples. To understand the effects, I added some statistics (not planned for upstream contribution). Here's an illustration from my motivating example:

Before this patch:

704357 SLP                          - Number of calcDeps actions
 699021 SLP                          - Number of schedule calls
   5598 SLP                          - Number of ReSchedule actions
     59 SLP                          - Number of ReScheduleOnFail actions
  10084 SLP                          - Number of schedule resets
   8523 SLP                          - Number of vector instructions generated

After this patch:

102895 SLP                          - Number of calcDeps actions
 161916 SLP                          - Number of schedule calls
   5637 SLP                          - Number of ReSchedule actions
     55 SLP                          - Number of ReScheduleOnFail actions
  10083 SLP                          - Number of schedule resets
   8403 SLP                          - Number of vector instructions generated

I do want to highlight that there is a small difference in number of generated vector instructions. I have to admit I'm confused by this, as in theory, the scheduling should not change this at all. I have not been able to reduce an example that differs, and I don't see such a change in the tests. Any ideas what might be going on here?

(Edit: I think I see the cause. This example is hitting the bailout due to maximum window size, and the change in scheduling is slightly perturbing when and how we hit it. This can be seen in the RescheduleOnFail counter change. Given that, I think we can safely ignore.)

The downside of this change can be seen in the large test diff. We group all vectorizable instructions together at the bottom of the scheduling region. This means that vector instructions can move quite far from their original point in code. While maybe undesirable, I don't see this as being a major problem as this pass is not intended to be a general scheduling pass.

For context, it's worth noting that the pre-scheduling that SLP does while building the vector tree is exactly the sub-graph scheduling implemented by this patch.

Diff Detail

Event Timeline

reames created this revision.Jan 29 2022, 7:50 AM
reames requested review of this revision.Jan 29 2022, 7:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 29 2022, 7:50 AM
reames edited the summary of this revision. (Show Details)Jan 29 2022, 8:52 AM
ABataev added inline comments.Jan 31 2022, 5:41 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

Can we keep this assert here or replace it with another one? It helps in many cases with incorrect scheduling.

reames added inline comments.Jan 31 2022, 11:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

Not easily. We'd need to track the increments through the calls to calculateDependencies since the set size now depends on the transitive use walk.

I get why you want this, but I don't see an easy way to preserve it.

Do you think it's worth the complexity of plumbing an assert only param through calculateDependencies?

ABataev added inline comments.Jan 31 2022, 12:19 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

No, sure not. But can you try to implement something simple here?

reames added inline comments.Jan 31 2022, 12:25 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

Did you have something particular in mind?

Not trying to be difficult, I just don't see a simple assert here.

reames updated this revision to Diff 405696.Feb 3 2022, 10:01 AM

Rebase and remove dependency on NFC changes. This should be able to land either before or after those.

reames added inline comments.Feb 4 2022, 1:20 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

Thinking about this, I could at least add an assert that all of the bundled instructions were scheduled. That wouldn't handle transitive users of the vector tree, but it would be better than nothing. Would that satisfy you?

ABataev added inline comments.Feb 11 2022, 9:55 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

This code is very sensitive and as you already noted it might be very useful to keep this (or similar) assert. Could you add an assert for bundled instructions and transitive users to be absolutely sure that neither this patch, nor future ones, break anything in scheduling?

reames updated this revision to Diff 409015.Feb 15 2022, 12:39 PM

Rebase over 2e507607 which includes the requested scheduling assert.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

Landed in 2e507607, patch rebased.

This turned out much simpler than I'd pictured, and is clearly warranted. Thank you for pushing on this.

ABataev added inline comments.Feb 15 2022, 1:07 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7926

Will try to test it tomorrow.

ping

I asked in rG2e507607754c25fae82c35d93d2ab53395be6ff8, duplicating here. Can you enable those extra checks for asserts build with EXPENSIVE_CHECKS?

ping

I asked in rG2e507607754c25fae82c35d93d2ab53395be6ff8, duplicating here. Can you enable those extra checks for asserts build with EXPENSIVE_CHECKS?

I just replied there.

This revision is now accepted and ready to land.Feb 22 2022, 7:41 AM
This revision was landed with ongoing or failed builds.Feb 22 2022, 10:16 AM
This revision was automatically updated to reflect the committed changes.

Hi, we've bisected some win32 miscompiles to this patch: https://crbug.com/1300322.
It seems like some inalloca allocas are getting moved around relative to their corresponding llvm.stacksave/stackrestore.
Not being familiar with SLPVectorizer, I'm not sure if this patch caused it or exposed some existing issue. Any advice?

rnk added a subscriber: rnk.Mar 1 2022, 2:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 1 2022, 2:08 PM

Based on some debug logs, it seems like the llvm.stacksave()/llvm.stackrestore()s are moving around, not the allocas.

Better reduced repro:

$ cat /tmp/a.ll
target datalayout = "e-m:x-p:32:32-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32-a:0:32-S32"
target triple = "i386-pc-windows-msvc19.16.0"

declare i8* @llvm.stacksave()

declare void @llvm.stackrestore(i8*)

declare i8* @wibble(i8*)

declare void @quux(i32* inalloca(i32))

define void @ham() #1 {
  %tmp2 = alloca i8
  %tmp3 = alloca i8
  %tmp4 = alloca i8
  %tmp5 = alloca i8
  %tmp12 = alloca [12 x i8*]
  %tmp15 = call i8* @wibble(i8* %tmp2)
  %tmp16 = call i8* @wibble(i8* %tmp3)
  %tmp17 = call i8* @wibble(i8* %tmp4)
  %tmp23 = call i8* @llvm.stacksave()
  %tmp24 = alloca inalloca i32
  call void @quux(i32* inalloca(i32) %tmp24)
  call void @llvm.stackrestore(i8* %tmp23)
  %tmp32 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 0
  store i8* %tmp4, i8** %tmp32
  %tmp33 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 1
  store i8* %tmp4, i8** %tmp33
  %tmp34 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 2
  store i8* %tmp4, i8** %tmp34
  %tmp35 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 3
  store i8* %tmp4, i8** %tmp35
  %tmp36 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 4
  store i8* %tmp4, i8** %tmp36
  %tmp37 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 5
  store i8* %tmp5, i8** %tmp37
  %tmp38 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 6
  store i8* %tmp5, i8** %tmp38
  %tmp39 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 7
  store i8* %tmp5, i8** %tmp39
  ret void
}

attributes #0 = { nofree nosync nounwind willreturn }
attributes #1 = { "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+sse3,+x87" }

$ bin/opt -passes=slp-vectorizer /tmp/a.ll -S
target datalayout = "e-m:x-p:32:32-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32-a:0:32-S32"
target triple = "i386-pc-windows-msvc19.16.0"

; Function Attrs: nofree nosync nounwind willreturn
declare i8* @llvm.stacksave() #0

; Function Attrs: nofree nosync nounwind willreturn
declare void @llvm.stackrestore(i8*) #0

declare i8* @wibble(i8*)

declare void @quux(i32* inalloca(i32))

define void @ham() #1 {
  %tmp2 = alloca i8, align 1
  %tmp3 = alloca i8, align 1
  %tmp12 = alloca [12 x i8*], align 4
  %tmp15 = call i8* @wibble(i8* %tmp2)
  %tmp16 = call i8* @wibble(i8* %tmp3)
  %tmp24 = alloca inalloca i32, align 4
  %tmp32 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 0
  %tmp33 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 1
  %tmp34 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 2
  %tmp35 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 3
  %1 = bitcast i8** %tmp32 to <4 x i8*>*
  %tmp36 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 4
  %tmp37 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 5
  %tmp38 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 6
  %tmp39 = getelementptr inbounds [12 x i8*], [12 x i8*]* %tmp12, i32 0, i32 7
  %tmp4 = alloca i8, align 1
  %tmp5 = alloca i8, align 1
  %tmp17 = call i8* @wibble(i8* %tmp4)
  %tmp23 = call i8* @llvm.stacksave()
  call void @quux(i32* inalloca(i32) %tmp24)
  call void @llvm.stackrestore(i8* %tmp23)
  %2 = insertelement <4 x i8*> poison, i8* %tmp4, i32 0
  %shuffle = shufflevector <4 x i8*> %2, <4 x i8*> poison, <4 x i32> zeroinitializer
  store <4 x i8*> %shuffle, <4 x i8*>* %1, align 4
  %3 = insertelement <4 x i8*> %2, i8* %tmp5, i32 1
  %shuffle1 = shufflevector <4 x i8*> %3, <4 x i8*> poison, <4 x i32> <i32 0, i32 1, i32 1, i32 1>
  %4 = bitcast i8** %tmp36 to <4 x i8*>*
  store <4 x i8*> %shuffle1, <4 x i8*>* %4, align 4
  ret void
}

attributes #0 = { nofree nosync nounwind willreturn }
attributes #1 = { "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+sse3,+x87" }

%tmp24 is no longer between the stacksave/stackrestore

Ideally we'd revert this patch, but there are a lot of patches on top of this and the revert isn't clean.

Ideally we'd revert this patch, but there are a lot of patches on top of this and the revert isn't clean.

I can try to do it tomorrow, if Philip won't revert it himself.

Actually this sequence seems to pass check-llvm with expensive checks on. Is this reasonable?
(only reverting the top two causes the test added in ee48b2f1c3f646a32557d21b7cf476466a8b8a2f to fail)

commit c0918b5685df34a1df56eaec9408b0053adce145 (HEAD)
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:23:04 2022 -0800

    Revert "[SLP] Schedule only sub-graph of vectorizable instructions"
    
    This reverts commit 0539a26d91a1b7c74022fa9cf33bd7faca87544d.

commit 97c3a3aea222b6cdc5972ebfa411a0f43ff45e7d
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:22:59 2022 -0800

    Revert "[SLP] Remove SchedulingPriority from ScheduleData [NFC]"
    
    This reverts commit a3e9b32c00959ad5c73189d8378d019fbe80ade5.

commit ee48b2f1c3f646a32557d21b7cf476466a8b8a2f
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:21:43 2022 -0800

    Revert "[SLP][NFC]Add a test for bottom to top reordering."
    
    This reverts commit b3f4535a039918965adb21509700739afc25f9f1.

commit b04a828ae3c19f5aa699f5e9821b04cb413bca9d
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:21:38 2022 -0800

    Revert "[SLP]Improve bottom-to-top reordering."
    
    This reverts commit e4b9640867150723b33f81c6479682fc955b55aa.

Actually this sequence seems to pass check-llvm with expensive checks on. Is this reasonable?
(only reverting the top two causes the test added in ee48b2f1c3f646a32557d21b7cf476466a8b8a2f to fail)

commit c0918b5685df34a1df56eaec9408b0053adce145 (HEAD)
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:23:04 2022 -0800

    Revert "[SLP] Schedule only sub-graph of vectorizable instructions"
    
    This reverts commit 0539a26d91a1b7c74022fa9cf33bd7faca87544d.

commit 97c3a3aea222b6cdc5972ebfa411a0f43ff45e7d
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:22:59 2022 -0800

    Revert "[SLP] Remove SchedulingPriority from ScheduleData [NFC]"
    
    This reverts commit a3e9b32c00959ad5c73189d8378d019fbe80ade5.

commit ee48b2f1c3f646a32557d21b7cf476466a8b8a2f
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:21:43 2022 -0800

    Revert "[SLP][NFC]Add a test for bottom to top reordering."
    
    This reverts commit b3f4535a039918965adb21509700739afc25f9f1.

commit b04a828ae3c19f5aa699f5e9821b04cb413bca9d
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:21:38 2022 -0800

    Revert "[SLP]Improve bottom-to-top reordering."
    
    This reverts commit e4b9640867150723b33f81c6479682fc955b55aa.

Can you try to revert only this:

commit c0918b5685df34a1df56eaec9408b0053adce145 (HEAD)
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:23:04 2022 -0800

    Revert "[SLP] Schedule only sub-graph of vectorizable instructions"
    
    This reverts commit 0539a26d91a1b7c74022fa9cf33bd7faca87544d.

commit 97c3a3aea222b6cdc5972ebfa411a0f43ff45e7d
Author: Arthur Eubanks <aeubanks@google.com>
Date:   Tue Mar 1 16:22:59 2022 -0800

    Revert "[SLP] Remove SchedulingPriority from ScheduleData [NFC]"
    
    This reverts commit a3e9b32c00959ad5c73189d8378d019fbe80ade5.

Can you try to revert only this:

That causes the test introduced in b3f4535a03991896 to fail (llvm/test/Transforms/SLPVectorizer/X86/bottom-to-top-reorder.ll)

Can you try to revert only this:

That causes the test introduced in b3f4535a03991896 to fail (llvm/test/Transforms/SLPVectorizer/X86/bottom-to-top-reorder.ll)

Could you just update test checks for this test?

Can you try to revert only this:

That causes the test introduced in b3f4535a03991896 to fail (llvm/test/Transforms/SLPVectorizer/X86/bottom-to-top-reorder.ll)

Could you just update test checks for this test?

Ah yes will do.

Can you try to revert only this:

That causes the test introduced in b3f4535a03991896 to fail (llvm/test/Transforms/SLPVectorizer/X86/bottom-to-top-reorder.ll)

Could you just update test checks for this test?

Ah yes will do.

Thanks!

reames added a comment.EditedMar 2 2022, 8:28 AM

Thank you for the revert. I agree that the test case above shows a correctness problem in SLP. I don't yet see how that's related to this change, but will investigate and see what falls out.

Initial investigation - SLP is allowing the formation of a bundle containing allocas. This is highly suspect as there is a dependency edge between an alloca and any following stacksave which doesn't exist in the code. There's also a question of whether static allocas should be reschedule such that they're no longer static.

I can definitely see why this interacts badly with the reverted patch. I'm currently suspicious the existing code is also wrong, working on trying to find a test case now.

Further update - the old code is "correct" for definitions of correct which require multiple accidents to keep correct behavior. More than that, the code relies on forming an illegal bundle (e.g. one which has missing dependencies) in order to get hit the shufflevector code path. Have I mentioned recently that this code is less than ideal? Working on a fix now.

Root issue fixed (689bab) and patch reapplied. Please confirm the unreduced example is also fixed; it's always possible there were two issues and we only reduced/fixed one.

uabelho added a comment.EditedMar 3 2022, 12:08 AM

Hello,

We see other problems that started appearing with this commit.
With

build-all-builtins/bin/clang -finline-hint-functions -fstack-protector-all -fwrapv -std=c99 -fsanitize=undefined -O3 'vla_sum_4.c' -fsanitize=undefined -l gcc_s -o 'vla_sum_4.out'
./vla_sum_4.out

we see

*** stack smashing detected ***: ./vla_sum_4.out terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fc20889c697]
/lib64/libc.so.6(+0x118652)[0x7fc20889c652]
./vla_sum_4.out[0x42b618]
./vla_sum_4.out[0x42b92d]
./vla_sum_4.out[0x42bb45]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc2087a6555]
./vla_sum_4.out[0x402d25]
======= Memory map: ========
00400000-0043f000 r-xp 00000000 fd:01 51014653                           /repo/uabelho/master-github/llvm/vla_sum_4.out
0063f000-00640000 r--p 0003f000 fd:01 51014653                           /repo/uabelho/master-github/llvm/vla_sum_4.out
00640000-00643000 rw-p 00040000 fd:01 51014653                           /repo/uabelho/master-github/llvm/vla_sum_4.out
00643000-00f84000 rw-p 00000000 00:00 0 
023ce000-023ef000 rw-p 00000000 00:00 0                                  [heap]
7fc208784000-7fc208948000 r-xp 00000000 fd:00 33598593                   /usr/lib64/libc-2.17.so
7fc208948000-7fc208b47000 ---p 001c4000 fd:00 33598593                   /usr/lib64/libc-2.17.so
7fc208b47000-7fc208b4b000 r--p 001c3000 fd:00 33598593                   /usr/lib64/libc-2.17.so
7fc208b4b000-7fc208b4d000 rw-p 001c7000 fd:00 33598593                   /usr/lib64/libc-2.17.so
7fc208b4d000-7fc208b52000 rw-p 00000000 00:00 0 
7fc208b52000-7fc208b54000 r-xp 00000000 fd:00 34860521                   /usr/lib64/libdl-2.17.so
7fc208b54000-7fc208d54000 ---p 00002000 fd:00 34860521                   /usr/lib64/libdl-2.17.so
7fc208d54000-7fc208d55000 r--p 00002000 fd:00 34860521                   /usr/lib64/libdl-2.17.so
7fc208d55000-7fc208d56000 rw-p 00003000 fd:00 34860521                   /usr/lib64/libdl-2.17.so
7fc208d56000-7fc208e57000 r-xp 00000000 fd:00 34860522                   /usr/lib64/libm-2.17.so
7fc208e57000-7fc209056000 ---p 00101000 fd:00 34860522                   /usr/lib64/libm-2.17.so
7fc209056000-7fc209057000 r--p 00100000 fd:00 34860522                   /usr/lib64/libm-2.17.so
7fc209057000-7fc209058000 rw-p 00101000 fd:00 34860522                   /usr/lib64/libm-2.17.so
7fc209058000-7fc20905f000 r-xp 00000000 fd:00 34860529                   /usr/lib64/librt-2.17.so
7fc20905f000-7fc20925e000 ---p 00007000 fd:00 34860529                   /usr/lib64/librt-2.17.so
7fc20925e000-7fc20925f000 r--p 00006000 fd:00 34860529                   /usr/lib64/librt-2.17.so
7fc20925f000-7fc209260000 rw-p 00007000 fd:00 34860529                   /usr/lib64/librt-2.17.so
7fc209260000-7fc209277000 r-xp 00000000 fd:00 33555219                   /usr/lib64/libpthread-2.17.so
7fc209277000-7fc209476000 ---p 00017000 fd:00 33555219                   /usr/lib64/libpthread-2.17.so
7fc209476000-7fc209477000 r--p 00016000 fd:00 33555219                   /usr/lib64/libpthread-2.17.so
7fc209477000-7fc209478000 rw-p 00017000 fd:00 33555219                   /usr/lib64/libpthread-2.17.so
7fc209478000-7fc20947c000 rw-p 00000000 00:00 0 
7fc20947c000-7fc209491000 r-xp 00000000 fd:00 34815442                   /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fc209491000-7fc209690000 ---p 00015000 fd:00 34815442                   /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fc209690000-7fc209691000 r--p 00014000 fd:00 34815442                   /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fc209691000-7fc209692000 rw-p 00015000 fd:00 34815442                   /usr/lib64/libgcc_s-4.8.5-20150702.so.1
7fc209692000-7fc2096b4000 r-xp 00000000 fd:00 34860516                   /usr/lib64/ld-2.17.so
7fc20987d000-7fc209882000 rw-p 00000000 00:00 0 
7fc2098a2000-7fc2098b3000 rw-p 00000000 00:00 0 
7fc2098b3000-7fc2098b4000 r--p 00021000 fd:00 34860516                   /usr/lib64/ld-2.17.so
7fc2098b4000-7fc2098b5000 rw-p 00022000 fd:00 34860516                   /usr/lib64/ld-2.17.so
7fc2098b5000-7fc2098b6000 rw-p 00000000 00:00 0 
7ffd12f86000-7ffd12fa8000 rw-p 00000000 00:00 0                          [stack]
7ffd12fcc000-7ffd12fce000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Abort

Edit: This happens both with the original commit as well as the reapplied one.

Another somewhat similar case but with less different compiler flags needed:

build-all-builtins/bin/clang -finline-hint-functions -std=c99 -fsanitize=undefined -O2 'vla_sum_1.c' -o 'vla_sum_1.out'
./vla_sum_1.out

Result:

UndefinedBehaviorSanitizer:DEADLYSIGNAL
==160192==ERROR: UndefinedBehaviorSanitizer: BUS on unknown address (pc 0x00000042b6c9 bp 0x7fff7d630950 sp 0x1000100010001 T160192)
==160192==The signal is caused by a READ memory access.
==160192==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x42b6c9  (/repo/uabelho/master-github/llvm/vla_sum_1.out+0x42b6c9)
    #1 0x42b857  (/repo/uabelho/master-github/llvm/vla_sum_1.out+0x42b857)
    #2 0x7fa2c69b4554  (/lib64/libc.so.6+0x22554) (BuildId: e6847a931dd483773bab779dd3985b17c11caab2)
    #3 0x402cb4  (/repo/uabelho/master-github/llvm/vla_sum_1.out+0x402cb4)

UndefinedBehaviorSanitizer can not provide additional info.
SUMMARY: UndefinedBehaviorSanitizer: BUS (/repo/uabelho/master-github/llvm/vla_sum_1.out+0x42b6c9) 
==160192==ABORTING

We see other problems that started appearing with this commit.

If you have known problems, please file a bug. Waiting until I fix one and then reporting something new in the review after a reland is less than helpful.

I'm about to revert again. I don't have time to investigate this immediately, so it'll probably be a day or two before I report back on the cause of this one.

thakis added a subscriber: thakis.Mar 3 2022, 11:31 AM

Sounds like you're already planning on re-reverting, but the unreduced case on our end started failing again at the reland as well: https://bugs.chromium.org/p/chromium/issues/detail?id=1300322#c15

Running slp-vectorizer on the attached file, you'll see that the same issue with stacksave/restore and inalloca allocas happens around the second call to IsDeprecatedMap.

We see other problems that started appearing with this commit.

If you have known problems, please file a bug. Waiting until I fix one and then reporting something new in the review after a reland is less than helpful.

I'm about to revert again. I don't have time to investigate this immediately, so it'll probably be a day or two before I report back on the cause of this one.

I wrote
https://github.com/llvm/llvm-project/issues/54197
and
https://github.com/llvm/llvm-project/issues/54198
about the two problems I've seen.

I did take a look at this, and figured out at least one of the problems. The scheduling code does not account for implicit control dependencies. (e.g. consider a case where we reorder two maythrow functions) The alloca symptom is one example of this, but there are also a bunch of others. The original code ends up being (mostly?) correct by relying on the scheduler not to change the order of two instructions which are both "ready". I have this bad feeling that the original code is wrong in some cornercase, but don't have a reproducer. To reland this patch, I will need to implement implicit control flow dependencies.

There may be another problem here as well. I discovered the above via code inspection, and I can't quite map how that bug causes the sanitizer failures. However, I haven't looked into that as a) the reproduction is annoying complicated and b) I've been focused on memoryssa in the meantime.

Ok, at this point, I've identified two missing dependency bugs in the original code, and fixed both. I believe these two to explain the regressions reported against this change in their entirety.

I have run a stage2 build with scheduling priority reversed (so as to help expose any further missing dependencies) and expensive checks (in this file only) enabled, both with and without this change applied. I do not see any indication of further missing dependencies, but scheduling defects which aren't caught by the verifier are only moderately likely to be caught by this approach.

I'm going to wait a couple days to make sure the dependency changes stick without issue, and then reapply with this change. In the meantime, if anyone wants to retest on top-of-tree with this change applied, I'd really appreciate positive confirmation that I got everything lurking here.

I've verified that the above two problems I saw both disappeared with 6253b77da9:

[SLP] Respect control dependence within a block during scheduling

I've verified that the above two problems I saw both disappeared with 6253b77da9:

[SLP] Respect control dependence within a block during scheduling

Thank you! Positive confirmation here very much appreciated.

Hi,

Heads up that I've seen the following assert (added in 6253b77da) trigger in one of our downstream tests:

opt: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:8066: void llvm::slpvectorizer::BoUpSLP::BlockScheduling::calculateDependencies(llvm::slpvectorizer::BoUpSLP::ScheduleData *, bool, llvm::slpvectorizer::BoUpSLP *): Assertion `DepDest && "must be in schedule window"' failed.

I'll try to extract a reproducer and come back.

uabelho added a comment.EditedMar 21 2022, 11:44 PM

Hi,

Heads up that I've seen the following assert (added in 6253b77da) trigger in one of our downstream tests:

opt: ../lib/Transforms/Vectorize/SLPVectorizer.cpp:8066: void llvm::slpvectorizer::BoUpSLP::BlockScheduling::calculateDependencies(llvm::slpvectorizer::BoUpSLP::ScheduleData *, bool, llvm::slpvectorizer::BoUpSLP *): Assertion `DepDest && "must be in schedule window"' failed.

I'll try to extract a reproducer and come back.

Ok, at commit 31486a9fc27 it crashed with

opt -passes='function<eager-inv>(slp-vectorizer)' -o /dev/null slp.ll

However, I noticed that it doesn't crash on trunk anymore after commit 79a182371e

[SLP]Make stricter check for instructions that do not require
scheduling.

Need to check that the instructions with external operands can be
reordered safely before actualy exclude them from the scheduling.

So perhaps/hopefully the problem is already solved.

Edit: Ok, now I realize this has been reported already in https://github.com/llvm/llvm-project/issues/54465

uabelho added a comment.EditedMar 24 2022, 7:05 AM

I've run some more testing with this patch reapplied locally without seeing problems so there at least doesn't seem to be anything that pops up at once for us.
(Always hard to know how much testing is "enough" when running fuzz testing though, but I've run tests for two nights without seeing anything else than the problem above that has already been fixed.)

@uabelho - Thank you! I really appreciate the amount of work you've put towards confirming that no further problems are lurking. You've gone above and beyond here, thanks!

Since we seem to have reasonable confidence that the underlying issues have been fixed, and that the fixes for those issues are stable in tree, I'm going to go ahead and push the original change again this morning. I will monitor for a couple hours in case of immediate fallout, but will be out of office for a good chunk of the day. If anyone spots fallout that looks severe, please feel free to revert without me.

Well, it seems I got slightly ahead of myself. When reviewing the diff before push, I spotted another questionable looking alloca/stacksave reordering. We appear to have one more missing dependency.

define void @stacksave2(i8 %a, i8 %b, i8 %c) {
; CHECK-LABEL: @stacksave2(
-; CHECK-NEXT: [[V1:%.*]] = alloca i8, align 1
; CHECK-NEXT: [[STACK:%.*]] = call i8* @llvm.stacksave()
+; CHECK-NEXT: [[B2:%.*]] = getelementptr i8*, i8
[[B:%.*]], i32 1
+; CHECK-NEXT: [[V1:%.*]] = alloca i8, align 1

Here we are sinking an alloca from above a stacksave into the region of the save/restore. This looks pretty suspect as there could be a use of v1 after the stackrestore and thus the transform could introduce an out of bounds access on the stack. The only way I can see this being correct was if the original program would have to be undefined, but I don't currently why that would be the case.

Here's where I'm very happy I spent time to write extensive tests even though at the time I didn't think we needed this dependence.

I've gone ahead and fixed the last missing dependence case I found, and landed the rebased patch. At this point, I'm not expecting further fallout, but if anyone sees anything, please feel free to immediately revert as long as you've filed a bug with a reproducer. I would not be shocked to find we do have another missing scheduling dependence somewhere.