This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Strength reduce cmp (xor (a, -1), xor(b, -1)) => cmp (b, a)
AbandonedPublic

Authored by dmgreen on Feb 6 2018, 1:35 AM.

Details

Summary

This is (hopefully) the last part of PR35875. The code sequence we currently have
is slightly worse than the original form, due to a higher register pressure causing
more spills on thumb1 cores with few registers.

The sequence:

%an = xor i8 %a, -1
%bn = xor i8 %b, -1
%cmp15 = icmp ult i8 %an, %bn
%cond = select i1 %cmp15, i8 %an, i8 %bn

is strength-reduced to

%an = xor i8 %a, -1
%bn = xor i8 %b, -1
%cmp15 = icmp ult i8 %b, %a
%cond = select i1 %cmp15, i8 %an, i8 %bn

I originally tried to do this during ISEL optimisation, but constant-hoist has
transformed the -1 to constants and pulled them out into higher blocks. So it
would need to look through truncate/zext/register chains into different basic
blocks. Instead it is done here in codegen prepare, late in the pipeline so as to
not break the representation of min/max.

Diff Detail

Event Timeline

dmgreen created this revision.Feb 6 2018, 1:35 AM

I'm skeptical about trying to solve a register pressure / allocation problem this early in the pipeline (although I know it's easier to match in IR).

If we do justify this somehow, then it must be limited by TTI hooks. Breaking min/max patterns for targets that support those ops would cause regressions. For example, if we fix the matchers to see vector 'not' ops on AArch64, we'll go from:

        mvn	v0.16b, v0.16b
	mvn	v2.16b, v2.16b
	mvn	v1.16b, v1.16b
	umin	v3.4s, v0.4s, v2.4s
	umin	v3.4s, v3.4s, v1.4s
	sub	v0.4s, v0.4s, v3.4s
	sub	v1.4s, v1.4s, v3.4s
	sub	v2.4s, v2.4s, v3.4s
	bl	vuse4

To:

        mvn	v4.16b, v0.16b
	mvn	v5.16b, v2.16b
	cmhi	v0.4s, v0.4s, v2.4s
	mvn	v1.16b, v1.16b
	bsl	v0.16b, v4.16b, v5.16b
	umin	v3.4s, v0.4s, v1.4s
	sub	v0.4s, v4.4s, v3.4s
	sub	v1.4s, v1.4s, v3.4s
	sub	v2.4s, v5.4s, v3.4s
	bl	vuse4

We'd get a similar regression on x86. That failure would show up in tests/CodeGen/AArch64/minmax-of-minmax.ll, but the pattern matching in this patch is artificially limited to scalars, so that's why we don't see it currently.

Thanks for taking a look. I've been looking into this more today. At finding a sensible target hook to put this behind and/or perhaps making it more specific. It's a great shame I can't just do this in ISel (right?), after selection has happened.

I agree it's not the best as it is, and there might be something else going on here. Not specifically register pressure related, just by knockon causing extra spills in the thumb1 case. It seems to be better for thumb2/aarch64 codegen too (where there are more registers), at least when put in a loop. Perhaps because it can reason it does not need to perform uxtb's / AND 0xff's? In other cases it can make things worse, even without an integer min/max instruction as in arm/aarch64. I know we have some issues with uxtbs where they are not necessary, but difficult to reason that they can be removed. Some of the guys here have been looking into that lately.

Note that the changes to make i8's instcombines legal got us ~12%. This gets us an extra 25% on top! At least on these targets, where it's hit us the hardest. Just from a one line change, switching the operators.

spatel added a comment.Feb 6 2018, 9:52 AM

Thanks for taking a look. I've been looking into this more today. At finding a sensible target hook to put this behind and/or perhaps making it more specific. It's a great shame I can't just do this in ISel (right?), after selection has happened.

Can you file (or maybe it's already filed?) a bug report that shows the output for the ARM/Thumb target where things are falling apart? I can see it being harder to match in the DAG, but it's not clear to me why we can't do it. Another possibility is trying to switch the operands around in MachineCombiner if we can show some kind of win via MachineTraceMetrics.

higher register pressure

You're essentially transforming SELECT_CC(ult, %an, %bn, %an, %bn) to SELECT_CC(ult, %b, %a, %an, %bn); that doesn't help register pressure at all, at least not on its own.

From your testcase, I guess the problem has something to do with legalization? We could increase register pressure if we zero-extend the operands to the compare, then don't reuse the zero-extended compare operands to produce the result. That seems like a problem we could solve more effectively some other way, though.

We could increase register pressure if we zero-extend the operands to the compare, then don't reuse the zero-extended compare operands to produce the result.

Yeah, bang on. Looks like both the cmp's are uxtb'd, where as the selects are not.

Back to the drawing board with this one I guess.

dmgreen abandoned this revision.Feb 19 2018, 8:35 AM