This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Do not emit extra shuffle for insertelements vectorization.
AbandonedPublic

Authored by ABataev on Aug 3 2021, 5:49 AM.

Details

Summary

If the vectorized insertelements instructions form indentity subvector
(the subvector at the beginning of the long vector), it is just enough
to extend the vector itself, no need to generate inserting subvector
shuffle.

Diff Detail

Event Timeline

ABataev created this revision.Aug 3 2021, 5:49 AM
ABataev requested review of this revision.Aug 3 2021, 5:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2021, 5:49 AM
RKSimon accepted this revision.Aug 3 2021, 7:11 AM

LGTM - cheers

This revision is now accepted and ready to land.Aug 3 2021, 7:11 AM
This revision was landed with ongoing or failed builds.Aug 3 2021, 1:19 PM
This revision was automatically updated to reflect the committed changes.
srj added a subscriber: srj.Aug 4 2021, 9:30 AM

This appears to have injected a failure into Halide codegen; I'll be providing a repro case shortly.

This appears to have injected a failure into Halide codegen; I'll be providing a repro case shortly.

Thanks for the report, waiting for the repro.

srj added a comment.Aug 4 2021, 9:47 AM

This appears to have injected a failure into Halide codegen; I'll be providing a repro case shortly.

Thanks for the report, waiting for the repro.

Still trying to get something self-contained for you, but the difference in codegen is something like:

before:

%193 = shufflevector <2 x double> %179, <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
...
%200 = insertelement <4 x double> %193, double 1.000000e+02, i32 3
%201 = insertelement <4 x double> %200, double %184, i32 2

after:

%193 = shufflevector <2 x double> %179, <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
...
%200 = insertelement <4 x double> %193, double %184, i32 2

This appears to have injected a failure into Halide codegen; I'll be providing a repro case shortly.

Thanks for the report, waiting for the repro.

Still trying to get something self-contained for you, but the difference in codegen is something like:

before:

%193 = shufflevector <2 x double> %179, <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
...
%200 = insertelement <4 x double> %193, double 1.000000e+02, i32 3
%201 = insertelement <4 x double> %200, double %184, i32 2

after:

%193 = shufflevector <2 x double> %179, <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
...
%200 = insertelement <4 x double> %193, double %184, i32 2

Hmm, very strange, the patch removes extra shuffle instruction, it has nothing to do with the insertelements. Waiting for the reproducer for investigation.

srj added a comment.Aug 4 2021, 10:45 AM

Some raw info from debugging one of our likely failure cases:

ShuffleOrOp = InsertElement

E->dump() is:

0.
    Operand 0:
      <4 x double> <double poison, double poison, double poison, double 1.000000e+02>
        %269 = insertelement <4 x double> <double poison, double poison, double poison, double 1.000000e+02>, double %218, i32 0
    Operand 1:
        %218 = load double, double* %210, align 8, !tbaa !38
        %215 = load double, double* %214, align 8, !tbaa !38
    Scalars:
        %269 = insertelement <4 x double> <double poison, double poison, double poison, double 1.000000e+02>, double %218, i32 0
        %270 = insertelement <4 x double> %269, double %215, i32 1
    State: Vectorize
    MainOp:   %269 = insertelement <4 x double> <double poison, double poison, double poison, double 1.000000e+02>, double %218, i32 0
    AltOp:   %269 = insertelement <4 x double> <double poison, double poison, double poison, double 1.000000e+02>, double %218, i32 0
    VectorizedValue: NULL
    ReuseShuffleIndices: Empty
    ReorderIndices:
    UserTreeIndices:

VL0->dump() is:

%269 = insertelement <4 x double> <double poison, double poison, double poison, double 1.000000e+02>, double %218, i32 0

The initial value for V (i.e., vectorizeTree(E->getOperand(1))) is:

%217 = load <2 x double>, <2 x double>* %216, align 8, !tbaa !38

We calculate:

NumElts = 4, NumScalars = 2, IsIdentity = 1 , Offset = 0

Thus we replace V with Builder.CreateShuffleVector(V, Mask), yielding a new V:

%268 = shufflevector <2 x double> %217, <2 x double> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>

Since Offset = 0, we (now) skip inserting the extra shufflevector, but this is wrong, since the extra shufflevector would produce:

%269 = shufflevector <4 x double> <double poison, double poison, double poison, double 1.000000e+02>, <4 x double> %268, <4 x i32> <i32 4, i32 5, i32 2, i32 3>
srj added a comment.Aug 4 2021, 10:56 AM

It's not entirely clear to me how the intended logic of this patch is supposed to work, but skipping the extra shufflevector when Offset == 0 seems puzzling here; in this case, we are definitely producing a meaningfully different result in this case (essentially, inserting value of 100 at index 3). I'm not sure what the correct fix is here (still looking) but I think this should give you enough info to make progress. (If not, we should should revert this change pending a proper fix.)

It's not entirely clear to me how the intended logic of this patch is supposed to work, but skipping the extra shufflevector when Offset == 0 seems puzzling here; in this case, we are definitely producing a meaningfully different result in this case (essentially, inserting value of 100 at index 3). I'm not sure what the correct fix is here (still looking) but I think this should give you enough info to make progress. (If not, we should should revert this change pending a proper fix.)

Yes, you're right, need an extra check here for the FirstInsert->getOperand(0), will fix it.

As this change is causing miscompiles on halide, I think it wise to revert until fixed. Any objections?

As this change is causing miscompiles on halide, I think it wise to revert until fixed. Any objections?

Doing it already, will be reverted soon.

srj added a comment.Aug 4 2021, 11:08 AM

As this change is causing miscompiles on halide, I think it wise to revert until fixed. Any objections?

To be fair, it's (potentially) causing miscompiles everywhere; Halide just exercises vectorization a lot more than most LLVM clients :-)

Doing it already, will be reverted soon.

Thanks! Please cc me on the updated patch and I will test it locally to verify that Halide is happy with it.

As this change is causing miscompiles on halide, I think it wise to revert until fixed. Any objections?

To be fair, it's (potentially) causing miscompiles everywhere; Halide just exercises vectorization a lot more than most LLVM clients :-)

Doing it already, will be reverted soon.

Thanks! Please cc me on the updated patch and I will test it locally to verify that Halide is happy with it.

Sure

ABataev reopened this revision.Aug 4 2021, 11:57 AM
This revision is now accepted and ready to land.Aug 4 2021, 11:57 AM
ABataev abandoned this revision.Aug 4 2021, 11:58 AM

Abandoned in favor of D107494