Page MenuHomePhabricator

[SLP]Improve graph reordering.
AcceptedPublic

Authored by ABataev on Jun 28 2021, 6:12 AM.

Details

Summary

Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dmgreen added inline comments.Jul 20 2021, 12:01 AM
llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗(On Diff #359457)

There are some patches in https://reviews.llvm.org/D106241 to add some extra worst case costs.

Matt added a subscriber: Matt.Jul 20 2021, 7:01 AM

I think we're just waiting to confirm that D106241 fixes the regressions?

I think we're just waiting to confirm that D106241 fixes the regressions?

Yep.

D106241 had some dependencies, but I've rebased it to get it out of the way.

Please can you rebase to confirm the aarch64 regressions have gone

Please can you rebase to confirm the aarch64 regressions have gone

Sure, will do it later.

RKSimon accepted this revision.Jul 24 2021, 1:28 AM

LGTM

This revision is now accepted and ready to land.Jul 24 2021, 1:28 AM
This revision was landed with ongoing or failed builds.Jul 28 2021, 5:53 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.Jul 28 2021, 6:13 AM

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Will check and fix it ASAP, thanks for the report.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Thanks, will check it too.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Investigated it, the crash is not related to this patch, caused by one of the previous patches. Going to publish a fix soon.

bjope added a subscriber: bjope.Jul 29 2021, 3:40 PM
bjope added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4326

With my OOT target I ended up here with getMinVecRegSize() returning 32.
E->getMainOp was a load like this %3 = load i24, i24* getelementptr inbounds ([128 x i24], [128 x i24]* @a_ua, i16 0, i16 3), so Sz was 24.

That gives a MinVF that is 0. And after some iterations the inner loop was entered with VF=0, which gives a slice that is empty, hitting assertions when doing Slice.front().

I haven't reduced this failure for any in-tree target (a bit short on people at the office here in the middle of summer). But maybe you should make sure MinVF doesn't go below 1 here (or maybe not even below 2) to avoid getting a VF that is less than 1 (or less than 2). Or check that VF is at least 1 (or 2) in the loop guard.

(I do not know if VF=1 makes any sense. Hence the alternatives above regarding 1 or 2.)

ABataev added inline comments.Jul 29 2021, 3:41 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4326

The fix is ready (see D107058), will commit it tomorrow.

bjope added inline comments.Jul 29 2021, 3:43 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4326

Ok, thanks!

hans added a subscriber: hans.Aug 2 2021, 4:49 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Hi, thanks for the report, will try to investigate it ASAP.

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

hans added a comment.Aug 2 2021, 7:25 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

hans added a comment.Aug 3 2021, 7:41 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

Ok, thanks, I see the wrong mask for the loads, comparing these 2 examples. Looking at the problem already, this should help to fix it ASAP.

phawkins added a subscriber: phawkins.EditedAug 3 2021, 8:04 AM

I bisected a miscompilation in XLA to this change.

Repro:

At this commit

clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

(no optimization)

and

clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

I bisected a miscompilation in XLA to this change.

Repro:

At this commit

clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

(no optimization)

and

clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

hans added a comment.Aug 3 2021, 8:19 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

hans added a comment.Aug 3 2021, 8:35 AM

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

I would not call it a stress, the reason is known, but I’ll try to revert it.

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}
$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}
$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

Yep, I'll revert. The fix is pretty simple but I'll revert and recommit the patch with all the fixes again

ABataev reopened this revision.Tue, Aug 24, 9:23 AM
This revision is now accepted and ready to land.Tue, Aug 24, 9:23 AM
ABataev planned changes to this revision.Tue, Aug 24, 9:23 AM
ABataev updated this revision to Diff 368371.Tue, Aug 24, 9:24 AM

Reworked significantly patch + rebase.

This revision is now accepted and ready to land.Tue, Aug 24, 9:24 AM
RKSimon accepted this revision.Wed, Aug 25, 10:17 AM

LGTM with a few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
540

Why didn't you move these up here instead of forward declaring?

1616–1638

This std::equal + pragma is repeated a lot in this method - worth pulling out?

3444

Indices?

ABataev added inline comments.Wed, Aug 25, 10:22 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
540

Just to reduce the number of changes, will move the functions here.

1616–1638

Ok, will try to transform it into a lambda or something similar.

3444

Yep, will fix it, thanks!

This revision was automatically updated to reflect the committed changes.

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

Instruction does not dominate all uses!

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

Instruction does not dominate all uses!

Ok, thanks for the reproducer. Going to revert the patch and investigate a crash.

alexfh added a subscriber: alexfh.Mon, Aug 30, 3:20 AM

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

Couldyou revert it yourself? I'm on vacation, will be back in 2 weeks.
I would appreciate it if you could send the reproducer. The patch os very complex, always triggers many different corner cases.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Thanks, this will help a lot!

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

That's what I thought. I would appreciate if you could send a better reproducer, though even in its current form it is usable and may help to find a bug. I believe it reveals a bug in vectorization of instructions with main/alternate opcode, will try to fix/improve it.

ABataev reopened this revision.Fri, Sep 17, 2:33 PM
This revision is now accepted and ready to land.Fri, Sep 17, 2:33 PM
ABataev updated this revision to Diff 373339.Fri, Sep 17, 2:33 PM

Fixed reordering of the nodes with alternate opcodes.

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811