If we're multiplying all elements of a vector by '0' or '1' then we can more efficiently perform this as a clearing mask (that is likely to further simplify to a shuffle blend).
This was noticed when reviewing D87502 but seems to help idiv/irem by constant cases even more as '0'/'1' values are often used for 'passthrough' cases.
I'm not sure what to make of the aarch64 changes - replacing a fused mla with and+add seems beneficial as the mul is by far the costlier op but I might be wrong....
It took me 3 tries to read right. I'd suggest something closer to