This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG][RISCV][X86][AArch64][AMDGPU][PowerPC] Improve SimplifyDemandedBits for SHL with NUW/NSW flags.
Needs RevisionPublic

Authored by liaolucy on Dec 25 2022, 7:59 PM.

Details

Summary

If the shift is NUW/NSW, then it does demand the high bits.

Diff Detail

Event Timeline

liaolucy created this revision.Dec 25 2022, 7:59 PM
liaolucy requested review of this revision.Dec 25 2022, 7:59 PM

Is this a case missing from DAGCombiner::BackwardsPropagateMask?

Is this a case missing from DAGCombiner::BackwardsPropagateMask?

In fact, I prefer the following modifications:

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index dbf318a85e9e..76aa3c6d7623 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -1761,6 +1761,8 @@ bool TargetLowering::SimplifyDemandedBits(
       }
 
       APInt InDemandedMask = DemandedBits.lshr(ShAmt);
+      if (Op.getNode()->getFlags().hasNoUnsignedWrap())
+        InDemandedMask.setHighBits(ShAmt);
       if (SimplifyDemandedBits(Op0, InDemandedMask, DemandedElts, Known, TLO,
                                Depth + 1))
         return true;

But this modification, for x86 is not friendly.

If there are more suggestions, please tell.

What prevents SimplifyDemandedBits from turning this back into an aextload and causing an infinite loop.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
10313

Need to check that the SHL is the only user of the load or this will duplicate a load.

10317

You need to replace the chain output of the original load with the chain output of the new load.

craig.topper requested changes to this revision.Dec 25 2022, 10:31 PM
This revision now requires changes to proceed.Dec 25 2022, 10:31 PM

Is this a case missing from DAGCombiner::BackwardsPropagateMask?

In fact, I prefer the following modifications:

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index dbf318a85e9e..76aa3c6d7623 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -1761,6 +1761,8 @@ bool TargetLowering::SimplifyDemandedBits(
       }
 
       APInt InDemandedMask = DemandedBits.lshr(ShAmt);
+      if (Op.getNode()->getFlags().hasNoUnsignedWrap())
+        InDemandedMask.setHighBits(ShAmt);
       if (SimplifyDemandedBits(Op0, InDemandedMask, DemandedElts, Known, TLO,
                                Depth + 1))
         return true;

But this modification, for x86 is not friendly.

If there are more suggestions, please tell.

@spatel @RKSimon @lebedev.ri is it a bug that we are neither demanding the upper bits for nuw/nsw nor removing the poison flags if SimplifyDemandedBits returns true?

Is this a case missing from DAGCombiner::BackwardsPropagateMask?

In fact, I prefer the following modifications:

diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index dbf318a85e9e..76aa3c6d7623 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -1761,6 +1761,8 @@ bool TargetLowering::SimplifyDemandedBits(
       }
 
       APInt InDemandedMask = DemandedBits.lshr(ShAmt);
+      if (Op.getNode()->getFlags().hasNoUnsignedWrap())
+        InDemandedMask.setHighBits(ShAmt);
       if (SimplifyDemandedBits(Op0, InDemandedMask, DemandedElts, Known, TLO,
                                Depth + 1))
         return true;

But this modification, for x86 is not friendly.

If there are more suggestions, please tell.

@spatel @RKSimon @lebedev.ri is it a bug that we are neither demanding the upper bits for nuw/nsw nor removing the poison flags if SimplifyDemandedBits returns true?

FMF are pretty rare in DAG, so this could be a simple oversight. It does seem like we should drop poison flags in that case, yes.

liaolucy updated this revision to Diff 485325.Dec 26 2022, 5:43 PM
liaolucy retitled this revision from [RISCV] Add DAG combine to fold (shl nuw (aextload), C) -> (shl nuw (zextload), C). to [SelectionDAG][RISCV][X86][AArch64][AMDGPU][PowerPC] Improve SimplifyDemandedBits for SHL with NUW/NSW flags..
liaolucy edited the summary of this revision. (Show Details)

Improve SimplifyDemandedBits for SHL with NUW/NSW flags.

lebedev.ri requested changes to this revision.Dec 26 2022, 5:53 PM

I think it shouldn't demand them. The only reason it "needs" them,
is to satisfy it's poison-generating flags. I think it would be better
to keep not demanding them, and drop the poison-generating flags.
For example consider:

%y = and %x, 15
%z = shl nuw %y, 4
 =>
%z = shl %x, 4

Now if you demand the high bits (and only because of nuw!),
you suddenly can't look past the and.

This revision now requires changes to proceed.Dec 26 2022, 5:53 PM

I think it shouldn't demand them. The only reason it "needs" them,
is to satisfy it's poison-generating flags. I think it would be better
to keep not demanding them, and drop the poison-generating flags.
For example consider:

%y = and %x, 15
%z = shl nuw %y, 4
 =>
%z = shl %x, 4

Now if you demand the high bits (and only because of nuw!),
you suddenly can't look past the and.

You do have a point, in fact I just want to optimize RISCV. I have a question now, I see InstCombineSimplifyDemanded.cpp has similar support, can you help to explain why?

I think it shouldn't demand them. The only reason it "needs" them,
is to satisfy it's poison-generating flags. I think it would be better
to keep not demanding them, and drop the poison-generating flags.
For example consider:

%y = and %x, 15
%z = shl nuw %y, 4
 =>
%z = shl %x, 4

Now if you demand the high bits (and only because of nuw!),
you suddenly can't look past the and.

You do have a point, in fact I just want to optimize RISCV.

Having re-read the whole thread here, it is not obvious to me what is the original problem that is trying to be solved here?

I have a question now, I see InstCombineSimplifyDemanded.cpp has similar support, can you help to explain why?

Optimizations in back-end, including this one, are mainly to deal with
optimization opportiunities that arise during instruction lowering/legalization,
and other sequences that are not well-exploitable in generic IR.
OTOH, the middle-end optimizations is where the optimizations should happen in general.
It's likely the InstCombineSimplifyDemanded code has the same bug.

It's a bug if we are both propagating flags and not accounting for them - we've definitely caught cases like that in IR.

I agree that it seems like the better choice would be to ignore/drop flags (the alternative is that flags could penalize optimization). But it's also possible that changing it could cause missed folds because we lost information by dropping flags. We probably just need to experiment and see what falls out from it.

arsenm added a subscriber: arsenm.Dec 27 2022, 9:01 AM
arsenm added inline comments.
llvm/test/CodeGen/AMDGPU/shl.ll
492 ↗(On Diff #485325)

This is worse

llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll
48 ↗(On Diff #485325)

This is obviously better, but also is -O0

It's a bug if we are both propagating flags and not accounting for them - we've definitely caught cases like that in IR.

I agree that it seems like the better choice would be to ignore/drop flags (the alternative is that flags could penalize optimization). But it's also possible that changing it could cause missed folds because we lost information by dropping flags. We probably just need to experiment and see what falls out from it.

I'm not sure what this is trying to fix originally,
but i think the test changes clearly show that we should be going in the opposite direction.

llvm/test/CodeGen/AArch64/arm64-shifted-sext.ll
199 ↗(On Diff #485325)

This seems to be a regression

llvm/test/CodeGen/AMDGPU/shl.ll
492 ↗(On Diff #485325)

This seems to be a regression

llvm/test/CodeGen/RISCV/rv64i-complex-float.ll
23 ↗(On Diff #485325)

This seems to be a regression

llvm/test/CodeGen/X86/parity.ll
404 ↗(On Diff #485325)

This seems to be a massive regression

533 ↗(On Diff #485325)

This seems to be a massive regression

657 ↗(On Diff #485325)

This seems to be a massive regression

llvm/test/CodeGen/X86/setcc.ll
76 ↗(On Diff #485325)

This seems to be a regression

llvm/test/CodeGen/X86/split-store.ll
176 ↗(On Diff #485325)

This seems to be a regression

197 ↗(On Diff #485325)

This seems to be a regression

217 ↗(On Diff #485325)

This seems to be a regression

llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
151 ↗(On Diff #485325)

This seems to be a regression

It's a bug if we are both propagating flags and not accounting for them - we've definitely caught cases like that in IR.

I agree that it seems like the better choice would be to ignore/drop flags (the alternative is that flags could penalize optimization). But it's also possible that changing it could cause missed folds because we lost information by dropping flags. We probably just need to experiment and see what falls out from it.

I'm not sure what this is trying to fix originally,
but i think the test changes clearly show that we should be going in the opposite direction.

I'll try to fill in the original problem. See the test in aext-to-zext.ll

We have initially have this DAG

Initial selection DAG: %bb.0 'read:entry'                                        
SelectionDAG has 19 nodes:                                                       
  t0: ch,glue = EntryToken                                                       
  t2: i64,ch = CopyFromReg t0, Register:i64 %0                                   
  t3: i64 = Constant<0>                                                          
  t11: i16 = Constant<8>                                                         
              t8: i64 = add nuw t2, Constant:i64<1>                              
            t9: i8,ch = load<(load (s8) from %ir.arrayidx1)> t0, t8, undef:i64   
          t10: i16 = zero_extend t9                                              
        t13: i16 = shl nuw t10, Constant:i64<8>                                  
          t5: i8,ch = load<(load (s8) from %ir.adr)> t0, t2, undef:i64           
        t6: i16 = zero_extend t5                                                 
      t14: i16 = or t13, t6                                                      
    t15: i64 = zero_extend t14                                                   
  t17: ch,glue = CopyToReg t0, Register:i64 $x10, t15                            
  t18: ch = RISCVISD::RET_FLAG t17, Register:i64 $x10, t17:1

Part way through the first DAGCombine, t10 zero_extend is replaced by an any_extend because the extended bits aren't demanded. The users are a shl by 8 and a zero_extend that reads bits 63:16 of the shl.

The extends get combined with loads to leave this DAG

Optimized lowered selection DAG: %bb.0 'read:entry'                              
SelectionDAG has 15 nodes:                                                       
  t0: ch,glue = EntryToken                                                       
  t2: i64,ch = CopyFromReg t0, Register:i64 %0                                   
            t8: i64 = add nuw t2, Constant:i64<1>                                
          t20: i16,ch = load<(load (s8) from %ir.arrayidx1), anyext from i8> t0, t8, undef:i64
        t13: i16 = shl nuw t20, Constant:i64<8>                                  
        t21: i16,ch = load<(load (s8) from %ir.adr), zext from i8> t0, t2, undef:i64
      t14: i16 = or t13, t21                                                     
    t15: i64 = zero_extend t14                                                   
  t17: ch,glue = CopyToReg t0, Register:i64 $x10, t15                            
  t18: ch = RISCVISD::RET_FLAG t17, Register:i64 $x10, t17:1

We type legalize and optimize to this

SelectionDAG has 16 nodes:                                                       
  t0: ch,glue = EntryToken                                                       
  t2: i64,ch = CopyFromReg t0, Register:i64 %0                                   
            t8: i64 = add nuw t2, Constant:i64<1>                                
          t22: i64,ch = load<(load (s8) from %ir.arrayidx1), anyext from i8> t0, t8, undef:i64
        t23: i64 = shl nuw t22, Constant:i64<8>                                  
        t24: i64,ch = load<(load (s8) from %ir.adr), zext from i8> t0, t2, undef:i64
      t25: i64 = or t23, t24                                                     
    t27: i64 = and t25, Constant:i64<65535>                                      
  t17: ch,glue = CopyToReg t0, Register:i64 $x10, t27                            
  t18: ch = RISCVISD::RET_FLAG t17, Register:i64 $x10, t17:1

The t27 'and' would be unnecessary if we had two zextloads instead of a zextload and an aextload.

This patch happens to prevent the original zero_extend from becoming an any_extend. Which solves the problem, but I don't think it's robust.