This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve graph reordering.
ClosedPublic

Authored by ABataev on Jun 28 2021, 6:12 AM.

Details

Summary

Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Diff Detail

Event Timeline

ABataev created this revision.Jun 28 2021, 6:12 AM
ABataev requested review of this revision.Jun 28 2021, 6:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2021, 6:12 AM
RKSimon added inline comments.Jul 5 2021, 9:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3060

This is very similar to the EntryState enum - merge them?

llvm/test/Transforms/SLPVectorizer/X86/addsub.ll
340–341

Update comment? I had to do something similar on D103925, although the wording here might be different.

ABataev added inline comments.Jul 6 2021, 6:05 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3060

I would not do this. Though the values look similar the meaning is completely different. This state handles only loads, while the entry state handles all possible entries kinds. In the future, we may have different values in these enums, which may lead to some unpredictable results. I would keep it as is.

llvm/test/Transforms/SLPVectorizer/X86/addsub.ll
340–341

Ok, will do

ABataev updated this revision to Diff 356698.Jul 6 2021, 6:17 AM

Rebase + address comments

A few minors things I've noticed so far. It looks like there's some renaming / nfc(ish) refactoring in here - if that could be pre-committed to reduce the size of this patch it'd be very welcome!

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
596

Do we need a default value for vector?

2569

Do we need a default value for vector?

2573

Doesn't the IsIdentity pass have to be done after Order[] is updated?

2595

Do we need a default value for vector?

ABataev added inline comments.Jul 9 2021, 9:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
596

Not currently, but I'll update these functions to make them compatible with upcoming non-power-2 vectorization.

2569

No, it is swapped with Order and Order then updated. But I'll add default initialization because we may need it in the future for non-pow-2 patch.

2573

Yes, you're right, will fix this. It may affect the cost in some rare cases.

2595

I'll add UndefMaskElem to reduce future updates for non-power-2.

ABataev updated this revision to Diff 357588.Jul 9 2021, 11:57 AM

Address comments

A few minors, but I haven't spotted anything critical - does anyone else have any comments?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3146

Default value?

4401

Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select, VecTy) ?

4402

Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify this patch?

6241

A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre-commit to simplify the patch?

ABataev marked an inline comment as done.Jul 14 2021, 8:17 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3146

It is initialized with zeroes by default.

4402

Will check.

6241

Yes, will precommit some of these changes.

ABataev added inline comments.Jul 16 2021, 2:38 PM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗(On Diff #359457)

Need to adjust the cost for 4xi16 shuffle in AArch64 target.

40–50 ↗(On Diff #359457)

Same here

RKSimon added a subscriber: dmgreen.
RKSimon added inline comments.
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗(On Diff #359457)

https://simd.godbolt.org/z/Pana3396f

@dmgreen Some very basic tests suggests the v4i16 shuffle cost should never be higher than 3 (which encouragingly matches what is already set for v4i32/v4f32) - do you agree?

// PermuteSingleSrc shuffle kinds.
// TODO: handle vXi8/vXi16.
{ TTI::SK_PermuteSingleSrc, MVT::v2i32, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v4i32, 3 }, // perfectshuffle worst case.
{ TTI::SK_PermuteSingleSrc, MVT::v2i64, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v2f32, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 3 }, // perfectshuffle worst case.
{ TTI::SK_PermuteSingleSrc, MVT::v2f64, 1 }, // mov.
dmgreen added inline comments.Jul 17 2021, 2:53 PM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗(On Diff #359457)

Yeah, I think that sounds right. GeneratePerfectShuffle applies to 4 element 64bit shuffles as well as 128bit shuffles, so the same 3 instruction worst case would apply.

I don't have a lot of tests that check SLP vectorization. Let me try some things and put a patch together if it looks sensible.

dmgreen added inline comments.Jul 20 2021, 12:01 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗(On Diff #359457)

There are some patches in https://reviews.llvm.org/D106241 to add some extra worst case costs.

Matt added a subscriber: Matt.Jul 20 2021, 7:01 AM

I think we're just waiting to confirm that D106241 fixes the regressions?

I think we're just waiting to confirm that D106241 fixes the regressions?

Yep.

D106241 had some dependencies, but I've rebased it to get it out of the way.

Please can you rebase to confirm the aarch64 regressions have gone

Please can you rebase to confirm the aarch64 regressions have gone

Sure, will do it later.

RKSimon accepted this revision.Jul 24 2021, 1:28 AM

LGTM

This revision is now accepted and ready to land.Jul 24 2021, 1:28 AM
This revision was landed with ongoing or failed builds.Jul 28 2021, 5:53 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.Jul 28 2021, 6:13 AM

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Will check and fix it ASAP, thanks for the report.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Thanks, will check it too.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Investigated it, the crash is not related to this patch, caused by one of the previous patches. Going to publish a fix soon.

bjope added a subscriber: bjope.Jul 29 2021, 3:40 PM
bjope added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4330

With my OOT target I ended up here with getMinVecRegSize() returning 32.
E->getMainOp was a load like this %3 = load i24, i24* getelementptr inbounds ([128 x i24], [128 x i24]* @a_ua, i16 0, i16 3), so Sz was 24.

That gives a MinVF that is 0. And after some iterations the inner loop was entered with VF=0, which gives a slice that is empty, hitting assertions when doing Slice.front().

I haven't reduced this failure for any in-tree target (a bit short on people at the office here in the middle of summer). But maybe you should make sure MinVF doesn't go below 1 here (or maybe not even below 2) to avoid getting a VF that is less than 1 (or less than 2). Or check that VF is at least 1 (or 2) in the loop guard.

(I do not know if VF=1 makes any sense. Hence the alternatives above regarding 1 or 2.)

ABataev added inline comments.Jul 29 2021, 3:41 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4330

The fix is ready (see D107058), will commit it tomorrow.

bjope added inline comments.Jul 29 2021, 3:43 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4330

Ok, thanks!

hans added a subscriber: hans.Aug 2 2021, 4:49 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Hi, thanks for the report, will try to investigate it ASAP.

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

hans added a comment.Aug 2 2021, 7:25 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

hans added a comment.Aug 3 2021, 7:41 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

Ok, thanks, I see the wrong mask for the loads, comparing these 2 examples. Looking at the problem already, this should help to fix it ASAP.

phawkins added a subscriber: phawkins.EditedAug 3 2021, 8:04 AM

I bisected a miscompilation in XLA to this change.

Repro:

At this commit

clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

(no optimization)

and

clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

I bisected a miscompilation in XLA to this change.

Repro:

At this commit

clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

(no optimization)

and

clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

hans added a comment.Aug 3 2021, 8:19 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

hans added a comment.Aug 3 2021, 8:35 AM

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

I would not call it a stress, the reason is known, but I’ll try to revert it.

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}
$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}
$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

Yep, I'll revert. The fix is pretty simple but I'll revert and recommit the patch with all the fixes again

ABataev reopened this revision.Aug 24 2021, 9:23 AM
This revision is now accepted and ready to land.Aug 24 2021, 9:23 AM
ABataev planned changes to this revision.Aug 24 2021, 9:23 AM
ABataev updated this revision to Diff 368371.Aug 24 2021, 9:24 AM

Reworked significantly patch + rebase.

This revision is now accepted and ready to land.Aug 24 2021, 9:24 AM
RKSimon accepted this revision.Aug 25 2021, 10:17 AM

LGTM with a few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
540

Why didn't you move these up here instead of forward declaring?

1585

This std::equal + pragma is repeated a lot in this method - worth pulling out?

3429

Indices?

ABataev added inline comments.Aug 25 2021, 10:22 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
540

Just to reduce the number of changes, will move the functions here.

1585

Ok, will try to transform it into a lambda or something similar.

3429

Yep, will fix it, thanks!

This revision was automatically updated to reflect the committed changes.

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

Instruction does not dominate all uses!

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

Instruction does not dominate all uses!

Ok, thanks for the reproducer. Going to revert the patch and investigate a crash.

alexfh added a subscriber: alexfh.Aug 30 2021, 3:20 AM

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

Couldyou revert it yourself? I'm on vacation, will be back in 2 weeks.
I would appreciate it if you could send the reproducer. The patch os very complex, always triggers many different corner cases.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Thanks, this will help a lot!

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

That's what I thought. I would appreciate if you could send a better reproducer, though even in its current form it is usable and may help to find a bug. I believe it reveals a bug in vectorization of instructions with main/alternate opcode, will try to fix/improve it.

ABataev reopened this revision.Sep 17 2021, 2:33 PM
This revision is now accepted and ready to land.Sep 17 2021, 2:33 PM
ABataev updated this revision to Diff 373339.Sep 17 2021, 2:33 PM

Fixed reordering of the nodes with alternate opcodes.

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

RKSimon accepted this revision.Sep 20 2021, 5:55 AM

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

LGTM - if you can add another (useful) test case for the latest repro that'd be great.

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

LGTM - if you can add another (useful) test case for the latest repro that'd be great.

Actually, already added. I used the previous reproducer, added test/Transforms/SLPVectorizer/X86/vectorize-reorder-alt-shuffle.ll which revealed the bug in the previous version.

This revision was automatically updated to reflect the committed changes.
goncharov reopened this revision.Oct 14 2021, 7:41 AM

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

This revision is now accepted and ready to land.Oct 14 2021, 7:41 AM

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

I'll check it, thanks!

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

The fix is here: https://reviews.llvm.org/D111898

vporpo added a subscriber: vporpo.Nov 11 2021, 8:00 PM
ABataev closed this revision.Nov 30 2021, 6:10 AM
vdmitrie added inline comments.Feb 8 2022, 3:01 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
588–592

The description is stale.