This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Need shrink the load vector after reordering.
ClosedPublic

Authored by ABataev on Jan 1 2021, 8:47 AM.

Details

Summary

After merging the shuffles, we cannot rely on the previous shuffle
anymore and need to shrink the final shuffle, if it is required.

Reported in D92668

Diff Detail

Event Timeline

ABataev created this revision.Jan 1 2021, 8:47 AM
ABataev requested review of this revision.Jan 1 2021, 8:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 1 2021, 8:47 AM
anton-afanasyev added inline comments.Jan 1 2021, 9:26 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4265

This comment is obsolete after change.

ABataev updated this revision to Diff 314368.Jan 4 2021, 5:40 AM

Removed outdated comment.

This revision is now accepted and ready to land.Jan 4 2021, 6:59 AM

I can confirm this fixes the non-reduced test case we're seeing internally, thanks!

This revision was automatically updated to reflect the committed changes.

This caused misoptimizations for armv7, where code that previously worked correctly now produce different results. (The code is clean under ubsan, so it shouldn't be relying on anything undefined.)

The issue appears with https://martin.st/temp/interplayvideo-preproc.c, compiled with clang -target armv7-linux-gnueabihf -O2.

The diff in generated code, before/after, looks like this:

        vmov.32 d16[0], lr
        vmov.32 d16[1], r2
 .LBB27_3:                               @ %if.end
                                         @   in Loop: Header=BB27_4 Depth=1
        vmov.32 r2, d16[1]
        add     r3, r3, #1
        vmov.32 r5, d16[0]
        cmp     r3, #8
+       vdup.32 d16, d16[0]
        vmov.16 d17[1], r2
        vmov.16 d18[0], r5
        vdup.16 d21, d17[1]
        vdup.16 d20, d18[0]
        vst1.16 {d20, d21}, [r1], r12
        beq     .LBB27_9
 .LBB27_4:                               @ %for.body
                                         @ =>This Inner Loop Header: Depth=1
        tst     r3, #3
        bne     .LBB27_3

If it loops back to .LBB27_3, the vector element d16[1] no longer has the value it was expected to have.

This caused misoptimizations for armv7, where code that previously worked correctly now produce different results. (The code is clean under ubsan, so it shouldn't be relying on anything undefined.)

The issue appears with https://martin.st/temp/interplayvideo-preproc.c, compiled with clang -target armv7-linux-gnueabihf -O2.

The diff in generated code, before/after, looks like this:

        vmov.32 d16[0], lr
        vmov.32 d16[1], r2
 .LBB27_3:                               @ %if.end
                                         @   in Loop: Header=BB27_4 Depth=1
        vmov.32 r2, d16[1]
        add     r3, r3, #1
        vmov.32 r5, d16[0]
        cmp     r3, #8
+       vdup.32 d16, d16[0]
        vmov.16 d17[1], r2
        vmov.16 d18[0], r5
        vdup.16 d21, d17[1]
        vdup.16 d20, d18[0]
        vst1.16 {d20, d21}, [r1], r12
        beq     .LBB27_9
 .LBB27_4:                               @ %for.body
                                         @ =>This Inner Loop Header: Depth=1
        tst     r3, #3
        bne     .LBB27_3

If it loops back to .LBB27_3, the vector element d16[1] no longer has the value it was expected to have.

Hi, thanks for the report, the patch is correct, the bug is in the existing code. Will fix it soon.

ABataev reopened this revision.Jan 8 2021, 6:35 AM
This revision is now accepted and ready to land.Jan 8 2021, 6:35 AM
ABataev updated this revision to Diff 315381.Jan 8 2021, 6:35 AM

Bug fixes

If you changed the existing patch it won't be clear that this is something that needs review. Can you make a new review with the bug fix?

Thanks!

-eric

ABataev closed this revision.Jan 19 2021, 7:39 AM

If you changed the existing patch it won't be clear that this is something that needs review. Can you make a new review with the bug fix?

Thanks!

-eric

Did it, thanks!