There are multiple possible ways to represent the X - urem X, Y pattern. SCEV was not canonicalizing, and thus, depending on which you analyzed, you could get different results. The sub representation appears to produce strictly inferior results in practice, so I decided to canonicalize to the Y * X/Y version.
The motivation here is that runtime unroll produces the sub X - (and X, Y-1) pattern when Y is a power of two. SCEV is thus unable to recognize that an unrolled loop exits because we don't figure out that the new unrolled step evenly divides the trip count of the unrolled loop. After instcombine runs, we convert the the andn form which SCEV recognizes, so essentially, this is just fixing a nasty pass ordering dependency.
Why this appears to minorly negatively impact hardware loop recognition on ARM, I have no idea. I definitely don't consider that a blocker. I can't even tell from the test if this is actually a regression - the test is too poorly structured to be informative.
What about the more general case of (-1 * urem X, Y) + X + Z --> ((-1 * urem X, Y) + X) + Z --> (Y * X/Y) + Z ?