[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE…

Authored by RKSimon on Jan 21 2021, 4:58 AM.


[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE (REAPPLIED).

Add DemandedElts support inside the TRUNCATE analysis.

REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d

Differential Revision: https://reviews.llvm.org/D56387

Event Timeline

Carrot added a subscriber: Carrot.Feb 11 2021, 3:42 PM

Compile the following cpp code

#include <vector>

struct S1 {
    int d;
    int f1 : 32; 
    int f2 : 32; 
    int index : 16; 
    int f4 : 16; 
    int f5 : 32; 
    int f6 : 29; 

struct S2 {
    const std::vector<S1>& A() const { return a; }

    std::vector<S1> a;

struct S3 {
    std::vector<int> data;

    void foo(const S2& B); 

void S3::foo(const S2& B) {
     for (const S1& s : B.A())
         data[s.index] = s.d;

With this patch, llvm generates

# %bb.0:
        movq    (%rsi), %rax
        movq    8(%rsi), %rcx
        cmpq    %rcx, %rax
        je      .LBB0_3
        .p2align        4, 0x90
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movl    (%rax), %edx
        movq    (%rdi), %rsi                            // Wrong address computation
        movl    %edx, -4(%rsi)                        // for data[s.index]
        addq    $24, %rax
        cmpq    %rcx, %rax
        jne     .LBB0_1
.LBB0_3:                                # %._crit_edge

In the loop body, the address computation for data[s.index] is wrong.
Correct loop body is

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        movl    (%rax), %edx
        movl    12(%rax), %esi
        shll    $16, %esi
        movslq  %esi, %rsi
        sarq    $16, %rsi
        movq    (%rdi), %rcx
        movl    %edx, (%rcx,%rsi,4)
        addq    $24, %rax
        cmpq    %r8, %rax
        jne     .LBB0_1

Nice catch!

@Carrot please can you raise a bug - this affects 12.xxx so we need to track it (and hopefully reduce the test case...)

Reduced : https://godbolt.org/z/snajbG

define i32* @PRXXXXXX(i32* %base, i160* %ptr160) {
  %load160 = load i160, i160* %ptr160, align 4
  %shl = shl i160 %load160, 80
  %ashr160 = ashr i160 %shl, 112
  %trunc = trunc i160 %ashr160 to i64
  %ashr64 = ashr i64 %trunc, 32
  %gep = getelementptr inbounds i32, i32* %base, i64 %ashr64
  ret i32* %gep
craig.topper added inline comments.

The bug is in this change. The old code limited the shift amount to the truncate result bitwidth. The new code is using the shift bit width which is larger. So it now tries to create a reduced width shift when it wasn't before. This shift ends up with an invalid shift amount and becomes undef.

RKSimon added inline comments.Feb 12 2021, 2:32 PM

Thanks for tracking it down - I got the src/dst bitiwths mixed up and thought it was covered by the inrange test in getValidShiftAmountConstant