Current implementation of visitShuffleVectorInst emits incorrect IR, when shufflevector instruction is followed by multiple bitcast instructions. Consider the following example:
define void @test(<16 x i8> %w, i32* %o1, float* %o2) { %v = shufflevector <16 x i8> %w, <16 x i8> undef, <4 x i32> <i32 12, i32 13, i32 14, i32 15> %f = bitcast <4 x i8> %v to float %i = bitcast <4 x i8> %v to i32 store i32 %i, i32* %o1, align 4 store float %f, float* %o2, align 4 ret void }
If you try to run opt -instcombine over it, you'll get the following IR:
define void @test(<16 x i8> %w, i32* %o1, float* %o2) { %v.bc = bitcast <16 x i8> %w to <4 x i32> %v.extract = extractelement <4 x i32> %v.bc, i32 3 %v.extract1 = shufflevector <16 x i8> %w, <16 x i8> undef, <16 x i32> <i32 3, i32 4, i32 5, i32 6, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> %v.bc2 = bitcast <16 x i8> %v.extract1 to <4 x float> %v.extract3 = extractelement <4 x float> %v.bc2, i32 0 store i32 %v.extract, i32* %o1, align 4 store float %v.extract3, float* %o2, align 4 ret void }
The weird shuffle %v.extract1 is emitted, because BegIdx is modified inside for-loop and becomes 3 instead of 12 (as it should be)
If you apply this patch you'll get following:
define void @test(<16 x i8> %w, i32* %o1, float* %o2) { %v.bc = bitcast <16 x i8> %w to <4 x i32> %v.extract = extractelement <4 x i32> %v.bc, i32 3 %v.bc1 = bitcast <16 x i8> %w to <4 x float> %v.extract2 = extractelement <4 x float> %v.bc1, i32 3 store i32 %v.extract, i32* %o1, align 4 store float %v.extract2, float* %o2, align 4 ret void }
The broken optimization causes incorrect calculation of DCT matrix with arm-neon extensions enabled and clang -O3