We can do several optimizations for PDEP using computeKnownBits and SimplifyDemandedBits
-If the MSBs of the output aren't demanded, those MSBs of the mask input aren't demanded either. We need to keep the most significant demanded bit of the mask and any mask bits before it.
-The number of possible ones in the mask determines how many bits of the lsbs of the other operand are demanded. Any bits of the mask we don't demand by the previous rule should not be counted.
-The result will have zeros in any position that the mask is zero.
-Since non-mask input bits can only be output in the original position or a higher bit position, the result will have at least as many trailing zeroes as the non-mask input.
Test cases have not been committed yet. But this patch was written to show the test diffs. Wanted to get feedback on the tests before committing.
It might be worth moving this into a KnownBits::DepositBits helper - IIRC RISCV has a similar instruction being finalised - maybe also move the instcombine constant folding handling into APInt ?