After merging the shuffles, we cannot rely on the previous shuffle
anymore and need to shrink the final shuffle, if it is required.
Reported in D92668
|  Differential  D93967  
[SLP]Need shrink the load vector after reordering. Authored by ABataev on Jan 1 2021, 8:47 AM. 
Details After merging the shuffles, we cannot rely on the previous shuffle Reported in D92668 
Diff Detail 
 Event Timeline
 Comment Actions This caused misoptimizations for armv7, where code that previously worked correctly now produce different results. (The code is clean under ubsan, so it shouldn't be relying on anything undefined.) The issue appears with https://martin.st/temp/interplayvideo-preproc.c, compiled with clang -target armv7-linux-gnueabihf -O2. The diff in generated code, before/after, looks like this:         vmov.32 d16[0], lr
        vmov.32 d16[1], r2
 .LBB27_3:                               @ %if.end
                                         @   in Loop: Header=BB27_4 Depth=1
        vmov.32 r2, d16[1]
        add     r3, r3, #1
        vmov.32 r5, d16[0]
        cmp     r3, #8
+       vdup.32 d16, d16[0]
        vmov.16 d17[1], r2
        vmov.16 d18[0], r5
        vdup.16 d21, d17[1]
        vdup.16 d20, d18[0]
        vst1.16 {d20, d21}, [r1], r12
        beq     .LBB27_9
 .LBB27_4:                               @ %for.body
                                         @ =>This Inner Loop Header: Depth=1
        tst     r3, #3
        bne     .LBB27_3If it loops back to .LBB27_3, the vector element d16[1] no longer has the value it was expected to have. Comment Actions Hi, thanks for the report, the patch is correct, the bug is in the existing code. Will fix it soon. Comment Actions If you changed the existing patch it won't be clear that this is something that needs review. Can you make a new review with the bug fix? Thanks! -eric | ||||||
This comment is obsolete after change.